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Abstract 

Various  machine  learning  tasks  such  as  learning  under  non-stationarity,  change  detection, 
and  dimensionality  reduction  can  be  solved  by  estimating  some  distances  between  proba¬ 
bility  distributions.  In  this  project,  we  developed  accurate  and  computationally  efficient 
methods  for  estimating  the  distances  from  data,  and  demonstrated  their  usefulness  in 
experiments. 

1  Introduction 

The  goal  of  machine  learning  is  to  find  useful  knowledge  behind  data.  Many  machine 
learning  tasks  contain  multiple  datasets  (such  as  data  taken  from  different  categories, 
different  time  periods,  etc.)  and  comparing  the  probability  distributions  behind  these 
datasets  is  a  fundamental  challenge  in  statistics  and  machine  learning  communities.  More 
specifically,  an  estimator  of  a  distance  between  probability  distributions  can  be  used  for 
solving  various  machine  learning  tasks  such  as  change  detection  in  time-series  and  semi- 
supervised  learning  under  class-balance  change.  In  this  project,  we  develop  a  unified 
framework  of  machine  learning  based  on  distances  between  probability  distributions. 

The  Kullback-Lciblcr  (KL)  distance  is  the  de-facto  standard  distance  measure  in  statis¬ 
tics  and  machine  learning,  because  of  its  high  compatibility  with  maximum  likelihood 
estimation.  However,  the  KL  distance  has  several  weaknesses  such  as  high  sensitivity  to 
outliers,  high  computational  requirements,  and  non-metricity.  In  this  project,  we  propose 
to  use  other  distances  than  the  KL  distance,  such  as  the  relative  ratio  based  distances  and 
the  difference  based  distances.  These  novel  distance  measures  can  potentially  overcome 
the  above  weaknesses  of  the  KL  distance. 
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2  Divergence  Estimation 

2 . 1  Background 

Let  us  consider  the  problem  of  approximating  a  divergence  D  between  two  probability 
distributions  P  and  P'  on  M0'  from  two  sets  of  independent  and  identically  distributed 
samples  X  :=  {cCj}f=1  and  X'  :=  {a3',}"/=1  following  P  and  P' . 

A  divergence  approximator  can  be  used  for  various  purposes  such  as  two-sample  testing 
[167,  86],  change  detection  in  time-series  [92],  class-prior  estimation  under  class-balance 
change  [45],  salient  object  detection  in  images  [214],  and  event  detection  from  movies 
[213]  and  Twitter  [112].  Furthermore,  an  approximator  of  the  divergence  between  the 
joint  distribution  and  the  product  of  marginal  distributions  can  be  used  for  solving  a 
wide  range  of  machine  learning  problems  [157],  including  independence  testing  [166], 
feature  selection  [179,  77],  feature  extraction  [178,  204],  canonical  dependency  analysis 
[89],  object  matching  [208],  independent  component  analysis  [177],  clustering  [175,  97], 
and  causal  direction  learning  [207].  For  this  reason,  accurately  approximating  a  divergence 
between  two  probability  distributions  from  their  samples  has  been  one  of  the  challenging 
research  topics  in  the  statistics,  information  theory,  and  machine  learning  communities. 

A  naive  way  to  approximate  the  divergence  from  P  to  P',  denoted  by  D(P\\P'),  is  to 
first  obtain  estimators  Px  and  P'x,  of  the  distributions  P  and  P'  separately  from  their 
samples  X  and  X',  and  then  compute  a  plug-in  approximator  D(Px\[P'Xi)-  However,  this 
naive  two-step  approach  violates  Vapnik ’s  principle  [194] : 

If  you  possess  a  restricted  amount  of  information  for  solving  some  problem, 
try  to  solve  the  problem  directly  and  never  solve  a  more  general  problem  as 
an  intermediate  step.  It  is  possible  that  the  available  information  is  sufficient 
for  a  direct  solution  but  is  insufficient  for  solving  a  more  general  intermediate 
problem. 

More  specifically,  if  we  know  the  distributions  P  and  P',  we  can  immediately  know  their 
divergence  D(P\\P').  However,  knowing  the  divergence  D(P\\P')  does  not  necessarily 
imply  knowing  the  distributions  P  and  P',  because  different  pairs  of  distributions  can 
yield  the  same  divergence  value.  Thus,  estimating  the  distributions  P  and  P’  is  more 
general  than  estimating  the  divergence  D(P\\P').  Following  Vapnik’s  principle,  direct 
divergence  approximators  D(X,X')  that  do  not  involve  the  estimation  of  distributions  P 
and  P’  have  been  developed  recently  [173,  124,  82,  212,  172], 

The  purpose  of  this  article  is  to  give  an  overview  of  the  development  of  such  direct  di¬ 
vergence  approximators.  In  Section  2.2,  we  review  the  definitions  of  the  Kullback-Leibler 
divergence,  the  Pearson  divergence,  the  relative  Pearson  divergence,  and  the  L2-distance, 
and  discuss  their  pros  and  cons.  Then,  in  Section  2.3,  we  review  direct  approxima¬ 
tors  of  these  divergences  that  do  not  involve  the  estimation  of  probability  distributions. 
In  Section  2.4,  we  show  practical  usage  of  divergence  approximators  in  unsupervised 
change- detection  in  time-series,  semi-supervised  class-prior  estimation  under  class-balance 
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change,  salient  object  detection  in  an  image,  and  evaluation  of  statistical  independence 
between  random  variables.  Finally,  we  conclude  in  Section  2.5. 

2.2  Divergence  Measures 

A  function  d(-,  •)  is  called  a  distance  if  and  only  if  the  following  four  conditions  are 
satisfied: 

•  Non- negativity:  Vx,  y,  d(x,  y)  >  0 

•  Non-degeneracy:  d(x,y )  =  0  x  —  y 

•  Symmetry:  Vx,?/,  d(x,y)  —  d(y,x) 

•  Triangle  inequality:  Vx,  y,  z  d(x,  z )  <  d(x,  y)  +  d(y,  z) 

A  divergence  is  a  pseudo- distance  that  still  acts  like  a  distance,  but  it  may  violate  some 
of  the  above  conditions.  In  this  section,  we  introduce  useful  divergence  and  distance 
measures  between  probability  distributions. 

2.2.1  Kullback-Leibler  (KL)  Divergence 

The  most  popular  divergence  measure  in  statistics  and  machine  learning  is  the  KL  diver¬ 
gence  [103]  defined  as 

KL(p||p')  :=  /  p(x)  log^-Uay 
J  P{x) 

where  p(x)  and  p'{x)  are  probability  density  functions  of  P  and  P',  respectively. 

Advantages  of  the  KL  divergence  are  that  it  is  compatible  with  maximum  likelihood 
estimation,  it  is  invariant  under  input  metric  change,  its  Riemannian  geometric  struc¬ 
ture  is  well  studied  [8],  and  it  can  be  approximated  accurately  via  direct  density-ratio 
estimation  [173,  124,  168].  However,  it  is  not  symmetric,  it  does  not  satisfy  the  triangle 
inequality,  its  approximation  is  computationally  expensive  due  to  the  log  function,  and 
it  is  sensitive  to  outliers  and  numerically  unstable  because  of  the  strong  non-linearity  of 
the  log  function  and  possible  unboundedness  of  the  density-ratio  function  p/p'  [36,  212], 

2.2.2  Pearson  (PE)  Divergence 

The  PE  divergence  [128]  is  a  squared-loss  variant  of  the  KL  divergence  defined  as 

PE(plb')  :=  f  p\x)  ~  l)  d*-  (!) 

Because  both  the  PE  and  KL  divergences  belong  to  the  class  of  Ali-Silvey-Csiszar  di¬ 
vergences  (which  is  also  known  as  /-divergences)  [5,  39],  they  share  similar  theoretical 
properties  such  as  the  invariance  under  input  metric  change. 
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The  PE  divergence  can  also  be  accurately  approximated  via  direct  density-ratio  esti¬ 
mation  in  the  same  way  as  the  KL  divergence  [82,  168].  However,  its  approximator  can 
be  obtained  analytically  in  a  computationally  much  more  efficient  manner  than  the  KL 
divergence,  because  the  quadratic  function  the  PE  divergence  adopts  is  compatible  with 
least-squares  estimation.  Furthermore,  the  PE  divergence  tends  to  be  more  robust  against 
outliers  than  the  KL  divergence  [170].  However,  other  weaknesses  of  the  KL  divergence 
such  as  asymmetry,  violation  of  the  triangle  inequality,  and  possible  unboundedness  of 
the  density-ratio  function  p/p'  remain  unsolved  in  the  PE  divergence. 

2.2.3  Relative  Pearson  (rPE)  Divergence 

To  overcome  the  possible  unboundedness  of  the  density-ratio  function  p/p',  the  rPE  di¬ 
vergence  was  recently  introduced  [212],  The  rPE  divergence  is  defined  as 

rPE(p||p')  :=  PE(p||ga) 

= ./  q°(x)  (Sk  ■ ') dx-  (2> 

where,  for  0  <  a  <  1,  qa  is  defined  as  the  a- mixture  of  p  and  p': 

qa  =  ap  +  (1  -  a)p. 

When  a  =  0,  the  rPE  divergence  is  reduced  to  the  plain  PE  divergence.  The  quantity 
p/qQ  is  called  the  relative  density  ratio ,  which  is  always  upper-bounded  by  1/a  for  a  >  0 
because 

p(x)  1  1 

qa(x)  «  +  (i_a)EM  <  «• 

Thus,  it  can  overcome  the  unboundedness  problem  of  the  PE  divergence,  while  the  in¬ 
variance  under  input  metric  change  is  still  maintained. 

The  rPE  divergence  is  still  compatible  with  least-squares  estimation,  and  it  can  be 
approximated  in  almost  the  same  way  as  the  PE  divergence  via  direct  relative  density- 
ratio  estimation  [212],  Indeed,  an  rPE-divergence  approximator  can  still  be  obtained 
analytically  in  an  accurate  and  computationally  efficient  manner.  However,  it  still  violates 
symmetry  and  the  triangle  inequality  in  the  same  way  as  the  KL  and  PE  divergence. 
Furthermore,  the  choice  of  a  may  not  be  straightforward  in  some  applications. 

2.2.4  L2-Distance 

The  L2-distance  is  another  standard  distance  measure  between  probability  distributions 
defined  as 

L2(p,p')  :=  I  (p(x)  -p\x)^j  dx. 
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The  L2-distance  is  a  proper  distance  measure,  and  thus  it  is  symmetric  and  satisfies  the 
triangle  inequality.  Furthermore,  the  density  difference  p(x)  —  p'(x)  is  always  bounded  as 
long  as  each  density  is  bounded.  Therefore,  the  L2-distance  is  stable,  without  the  need 
of  tuning  any  control  parameter  such  as  a  in  the  rPE  divergence. 

The  L2-distance  is  also  compatible  with  least-squares  estimation,  and  it  can  be  accu¬ 
rately  and  analytically  approximated  in  a  computationally  efficient  and  numerically  stable 
manner  via  direct  density -difference  estimation  [172],  However,  the  L2-distance  is  not  in¬ 
variant  under  input  metric  change,  which  is  a  unique  property  inherent  to  ratio-based 
divergences. 

2.3  Direct  Divergence  Approximation 

In  this  section,  we  review  recent  advances  in  direct  divergence  approximation. 

Suppose  that  we  are  given  two  sets  of  independent  and  identically  distributed  samples 
X  :=  {xi}f=1  and  X'  :=  {cc',}”=1  from  probability  distributions  on  with  densities  p(x) 
and  p'(x),  respectively: 


X-.=  {Xl}U  Wpix), 

Our  goal  is  to  approximate  a  divergence  between  from  p  to  p'  from  samples  X  and  X'. 

2.3.1  KL  Divergence  Approximation 

The  key  idea  of  direct  KL  divergence  approximation  is  to  estimate  the  density  ratio  p/p' 
without  estimating  the  densities  p  and  p'  [173].  More  specifically,  a  density-ratio  estimator 
is  obtained  by  minimizing  the  KL  divergence  from  p  to  r-pf  with  respect  to  a  density-ratio 
model  r,  under  the  constraints  that  the  density-ratio  function  is  non- negative  and  r  ■  p'  is 
integrated  to  one: 


min  KL(p||r  •  p) 

r 

subject  to  r  >  0  and  J  r(x)p'(x)dx  =  1. 

Its  empirical  optimization  problem,  where  an  irrelevant  constant  is  ignored  and  the  ex¬ 
pectations  are  approximated  by  the  sample  averages,  is  given  by 


1 

max  — 
r  n 


y^log  r(xj 


1=1 


subject  to  r  >  0  and  —  r(xi>)  —  1 


i’= 1 


5 


(3) 


Let  us  consider  the  following  Gaussian  density-ratio  model: 


n 

r(x)  =  9t  exp 
i=i 


X  —  Xu 

2a2 
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where 


denotes  the  £2-norni.  We  define  the  vector  of  parameters  {0^}”=1  as 


0  =  (0i, . . . ,  0n)T, 


where  T  denotes  the  transpose.  In  this  model,  the  Gaussian  kernels  are  located  on  nu¬ 
merator  samples  {xi}™=l  because  the  density  ratio  p/p'  tends  to  take  large  values  in  the 
regions  where  the  numerator  samples  {xiY/=l  exist.  Alternatively,  Gaussian  kernels  may 
be  located  on  both  numerator  and  denominator  samples,  but  this  seems  not  to  further  im¬ 
prove  the  accuracy  [173].  When  n  is  very  large,  a  (random)  subset  of  numerator  samples 
{cCj}"=1  may  be  chosen  as  Gaussian  centers,  which  can  reduce  the  computational  cost. 

For  the  Gaussian  density-ratio  model  (3),  the  above  optimization  problem  is  expressed 
as 


max  — 
9  n 


J^iog  £  6 1  exp 


i=  1 


,  £=1 


\Xj  -  Xj\ 
2a2 


subject  to  9n  >  0 


andrf£X£ 


V  exp 


i'= 1  1=1 


Ix'y-XtW2 

2a2 


=  1. 


This  is  a  convex  optimization  problem  and  thus  the  global  optimal  solution  can  be  ob¬ 
tained  easily,  e.g.,  by  gradient-projection  iterations.  Furthermore,  the  global  optimal 
solution  tends  to  be  sparse  (i.e. ,  many  parameter  values  become  exactly  zero),  which  can 
be  utilized  for  reducing  the  computational  cost. 

The  Gaussian  width  a  is  a  tuning  parameter  in  this  algorithm,  and  it  can  be  system¬ 
atically  optimized  by  cross-validation  with  respect  to  the  objective  function.  More  specif¬ 
ically,  the  numerator  samples  X  :=  (cc,}”=1  are  divided  into  T  disjoint  subsets  { Xt}J=1 
of  (approximately)  the  same  size.  Then  a  density-ratio  estimator  rt(x)  is  obtained  using 
X\Xt  and  X'  :=  {cc',}”=1  (i.e.,  all  numerator  samples  without  Xt  and  all  denominator 
samples),  and  its  objective  value  for  the  hold-out  numerator  samples  Xt  is  computed: 


1 

Xt 


logfi(aj), 

x£Xt 


where  \Xt\  denotes  the  number  of  elements  in  the  set  Xt.  This  procedure  is  repeated  for 
t  —  1, . . . ,  T,  and  the  a  value  that  maximizes  the  average  of  the  above  hold-out  objective 
values  is  chosen  as  the  best  one. 

Given  a  density-ratio  estimator  r,  a  KL-divergence  approximator  KL(A’||A’/)  can  be 
constructed  as 

—  i  n 

Kh{X\\X')  :=  -  Vlog^). 
n 

i=  1 
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Variations  of  this  procedure  for  other  density-ratio  models  have  been  developed,  in¬ 
cluding  the  log- linear  model  [187],  the  Gaussian  mixture  model  [206],  and  the  mixture  of 
probabilistic  principal  component  analyzers  [211].  Also,  an  unconstrained  variant,  which 
corresponds  to  approximately  maximizing  the  Legendre- Fenchel  lower  bound  of  the  KL 
divergence  [95],  was  proposed  [124]: 


KL(X\\X') 


max 

r 


1  x  A 

-V  log  r(xi) 

n  “ ' 


1 

n' 


ri 


J2r(x'r)  +  1 

i'= 1 


2.3.2  PE  Divergence  Approximation 

The  PE  divergence  can  also  be  directly  approximated  without  estimating  the  densities  p 
and  //  via  direct  estimation  of  the  density  ratio  p/p'  [82],  More  specifically,  a  density- 
ratio  estimator  is  obtained  by  minimizing  the  //-weighted  squared  difference  between  a 
density-ratio  model  r  and  the  true  density-ratio  function  p/p': 

min  /  p\x)  (r{x)  —  dan 

r  J  \  p'{x)J 

Its  empirical  criterion  where  an  irrelevant  constant  is  ignored  and  the  expectations  are 
approximated  by  the  sample  averages  is  given  by 


min 

r 


1 

n' 


n' 

i'= 1 


2 

71 


n 

i— 1 


For  the  Gaussian  density-ratio  model  (3)  together  with  the  /2-regularizer,  the  above 
optimization  problem  is  expressed  as 


min 

0 


dTG'6  -2GTh  +  X\\G\\2 


(4) 


where  A  >  0  denotes  the  regularization  parameter,  G '  is  the  n  x  n  matrix  with  the  (£,  £')-th 
element  defined  by 


1  n 

G^':=  ^exp 
i>= 1 


at. 


xe\\‘ 


2cr2 


exp 


X; 


Xf\ 


2a2 


and  h  is  the  n-dimensional  vector  with  the  Gth  element  defined  by 


This  is  a  convex  optimization  problem,  and  the  global  optimal  solution  can  be  computed 
analytically  as 

(G'  +  AI)-1^, 
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where  I  denotes  the  identity  matrix. 

The  Gaussian  width  o  and  the  regularization  parameter  A  are  the  tuning  parameters 
in  this  algorithm,  and  they  can  be  systematically  optimized  by  cross-validation  with 
respect  to  the  objective  function  as  follows:  First,  the  numerator  and  denominator  samples 
X  =  {xi}™=1  and  X'  =  {cc',}”,=1  are  divided  into  T  disjoint  subsets  {Xt}J=1  and  {XJ.}J=1, 
respectively.  Then  a  density-ratio  estimator  rt(x)  is  obtained  using  X\Xt  and  X'\X[  (i.e., 
all  samples  without  Xt  and  X[),  and  its  objective  value  for  the  hold-out  samples  Xt  and 
X[  is  computed: 

T^n  ~  tJm  rt(x).  (5) 

'  t' x'&xi  '  t'  xext 

This  procedure  is  repeated  for  t  —  1, . . . ,  T,  and  the  a  and  A  values  that  maximize  the 
average  of  the  above  hold-out  objective  values  are  chosen  as  the  best  ones. 

By  expanding  the  squared  term  ^ ^  —  1^  in  Eq.(l),  the  PE  divergence  can  be 
expressed  as 


Note  that  Eq.(7)  can  also  be  obtained  via  Legendre-Fenchel  convex  duality  of  the  di¬ 
vergence  functional  [140].  Based  on  these  expressions,  PE  divergence  approximators  are 
obtained  using  a  density-ratio  estimator  r  as 

—  1  n 

PE(*||  X'):=-J2nxi)~l,  (8) 

i=  1 

—  1  n'  9  n 

PE(V||V)  (9) 

n  A — '  n  z — ' 

i'=l  i= 1 

Eq.(8)  is  suitable  for  algorithmic  development  because  this  would  be  the  simplest  ex¬ 
pression,  while  Eq.(9)  is  suitable  for  theoretical  analysis  because  this  corresponds  to  the 
negative  of  the  objective  function  in  Eq.(4). 

If  the  ^-regularize!' 

n 

1=  1 


in  Eq.(4)  is  replaced  with  the  G -regularize!' 


the  solution  tends  to  be  sparse  [182] .  Then  the  solution  can  be  obtained  in  a  computation¬ 
ally  more  efficient  way  [185],  and  furthermore  a  regularization  path  tracking  algorithm 
[48]  is  available  for  efficiently  computing  solutions  with  different  regularization  parameter 
values. 


2.3.3  rPE  Divergence  Approximation 

The  rPE  divergence  can  be  directly  estimated  in  the  same  way  as  the  PE  divergence  [212]: 

min  J  qa(x')  ^r(cc)  —  y ^  dan 

Its  empirical  criterion  where  an  irrelevant  constant  is  ignored  and  the  expectations  are 
approximated  by  sample  averages  is  given  by 


min 

r 


E 

i=  1 


r2(Xj 


i'= i 


i= 1 


For  the  Gaussian  density-ratio  model  (3)  together  with  the  G-regularizer,  the  above 
optimization  problem  is  expressed  as 


min 

e 


6T(aG  +  (1 


a)G')0  —  26Th  +  A||0||2  , 


where  G  is  the  n  x  n  matrix  with  the  (£,  P)-th  element  defined  by 

1  n 

G(  f/  :=  -  Y"  exp 
n 

i= 1 

This  is  a  convex  optimization  problem,  and  the  global  optimal  solution  can  be  computed 
analytically  as 

(aG  +  (1  -  a)G'  +  A I)~lh. 


(  Haii  -  xt\\~\  (  ||a:* 

{ - 2^JeXP  { - 2a^  ) 


Cross-validation  for  tuning  the  Gaussian  width  a  and  the  regularization  parameter  A  can 
be  carried  out  in  the  same  way  as  the  PE-divergence  case,  with  Eq.(5)  replaced  by 


a 

IX 


^rt(x) 

xGXt 


1  —  a 

W 


E r,,x 

x'  GlXL 


/a2 


\Xt 


■^2rt(x). 


xdXt 


By  expanding  the  squared  term 
expressed  as 


in  Eq.(2),  the  rPE  divergence  can  be 


rPE 


p(x) 

qa(x) 


dx  —  1 


dx  +  2  [p(x)1^-dx-l. 

J  qa(x) 


(10) 

(11) 
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Based  on  these  expressions,  rPE  divergence  approximators  are  given  using  the  relative 
density-ratio  estimator  ra  as 


rPEa(A||A') 

rPEa(A||T") 


y ^rQ(Xi)  -  1, 
n 

i=  1 

— -X/afoi)2-^1  ^y^^x'i')2 +~y2ra{xi)  - 1. 

77  t  ^  77 '  (  J  77  (  J 


i=  1 


i'=l 


(12) 

(13) 


2.3.4  L2-Distance  Approximation 

The  key  idea  is  to  directly  estimate  the  density  difference  p  —  p'  without  estimating  each 
density  [172],  More  specifically,  a  density-difference  estimator  is  obtained  by  minimizing 
the  squared  difference  between  a  density-difference  model  /  and  the  true  density-difference 
function  p  —  p'\ 


min 

/ 


/ 


(/(*)  -  W*) 


Its  empirical  criterion  where  an  irrelevant  constant  is  ignored  and  the  expectation  is 
approximated  by  the  sample  average  is  given  by 


min 

/ 


f(x)2dx 


yy  /(**) 

i=  1 


2 

n' 


Let  us  consider  the  following  Gaussian  density-difference  model: 


n-\-nr 

/(*)  =  ^exp 

t= i 


X  -  ct 
2a2 


(14) 


where 


(ci,  .  .  .  ,  Cni  Cn+1,  •  •  •  ,  Cn- fn')  •  (®lj  •  •  •  ?  Xn,  aq,  •  •  •  ,  Xnt ) 
are  Gaussian  centers.  Then  the  above  optimization  problem  is  expressed  as 

[St[/S«2£tC+A||S||21 


nnn 


where  the  72-regularizer  A||£||2  is  included,  U  is  the  (n  +  n')  x  (n  +  n')  matrix  with  the 
(£,  P)-th  element  defined  by 


Ut,e>  ■=  /exp 


\X  -  Q | 

2(t2 


exp 


\X  -  Q/ 


2(t2 


da; 


=  (7ro-2)d/2exp 


|Q  —  Qq 
4a2 
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d  denotes  the  dimensionality  of  x ,  and  v  is  the  (n  +  n')-dimensional  vector  with  the  f'-th 
element  defined  by 


ve  ■  = 


1 

n 


^2  exp 

i= 1 


^  \\xj  —  q||2^ 


1 

n' 


^2  exp 

i'= 1 


This  is  a  convex  optimization  problem,  and  the  global  optimal  solution  can  be  computed 
analytically  as 

(U  +  XI^v. 


The  above  optimization  problem  is  essentially  the  same  form  as  least-squares  density- 
ratio  approximation  for  the  PE  divergence,  and  therefore  least-squares  density-difference 
approximation  can  enjoy  all  the  computational  properties  of  least-squares  density-ratio 
approximation. 

Cross-validation  for  tuning  the  Gaussian  width  a  and  the  regularization  parameter  A 
can  be  carried  as  follows:  First,  the  numerator  and  denominator  samples  X  =  {cc,;}”=1 
and  X'  =  are  divided  into  T  disjoint  subsets  {Xt}J=1  and  {X^}J=1,  respectively. 

Then  a  density-difference  estimator  ft(x)  is  obtained  using  X\Xt  and  X'\X[  (i.e. ,  all 
samples  without  Xt  and  X[),  and  its  objective  value  for  the  hold-out  samples  Xt  and  X[ 
is  computed: 

ft(x)2dx  -  yj-  J2  ft(x)  +  tJtt  J2 
'  ^  xext  '  t'  x’exi 

Note  that  the  first  term  can  be  computed  analytically  for  the  Gaussian  density-difference 
model  (14): 

f  f,(x)2dx  =  H  ul, 


where  is  the  parameter  vector  learned  from  X\Xt  and  X'\X[. 

For  an  equivalent  expression  of  the  L2-distance, 

L2(p,p')  =  j  f(x)p(x) dx-  I  f(x')p'(x')dx', 

if  /  is  replaced  with  a  density- difference  estimator  /  and  approximate  the  expectations 
by  empirical  averages,  the  following  L2-distance  approximator  can  be  obtained: 

vT£.  (15) 

Similarly,  for  another  expression 

L2(p,p)  =  j  f(x)2 dx, 
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replacing  /  with  a  density-difference  estimator  /  gives  another  L2-distance  approximator: 

iTul  (16) 

Eq.(15)  and  Eq.(16)  themselves  give  valid  approximations  to  L2(p,p'),  but  their  linear 
combination 


L2(X,X')  :=  2vT£-£ru£, 

was  shown  to  have  a  smaller  bias  than  than  Eq.(15)  and  Eq.(16). 

2.4  Usage  of  Divergence  Approximators  in  Machine  Learning 

In  this  section,  we  show  applications  of  divergence  approximators  in  machine  learning. 

2.4.1  Change-Detection  in  Time-Series 

The  goal  is  to  discover  abrupt  property  changes  behind  time-series  data.  Let  y{t)  G  Mm 
be  an  m-dimensional  time-series  sample  at  time  t,  and  let 

y(t)~  ly(t)T,  V(t  +  1)T,  ■  ■  ■ .  y(t  +  *  -  1)T]T  6  k‘” 

be  a  subsequence  of  time  series  at  time  t  with  length  k.  Instead  of  a  single  point  y(t),  the 
subsequence  Y (t)  is  treated  as  a  sample  here,  because  time-dependent  information  can 
be  naturally  incorporated  by  this  trick  [92],  Let 

y(t)  :={Y(t),Y(t  +  l),...,Y(t  +  r-l)} 

be  a  set  of  r  retrospective  subsequence  samples  starting  at  time  t.  Then  a  divergence 
between  the  probability  distributions  of  y(t)  and  y(t  +  r )  may  be  used  as  the  plausibility 
of  change  points  (see  Figure  1). 

In  Figure  2,  we  illustrate  results  of  unsupervised  change  detection  for  the  IPSJ  SIG- 
SLP  Corpora  and  Environments  for  Noisy  Speech  Recognition  (CENSREC)  dataset1  that 
records  human  voice  in  noisy  environments  such  as  a  restaurant,  and  the  Human  Activ¬ 
ity  Sensing  Consortium  (HASC)  challenge  2011  dataset2  that  provides  human  activity 
information  collected  by  portable  three-axis  accelerometers.  These  graphs  show  that  the 
KL-based  method  is  excessively  sensitive  to  noise  and  thus  change  points  are  not  clearly 
detected.  On  the  other  hand,  the  L2-based  method  more  clearly  indicates  the  existence 
of  change  points. 

It  was  also  demonstrated  that  divergence-based  change-detection  methods  are  useful 
in  event  detection  from  movies  [213]  and  Twitter  [112]. 


1h.ttp : / / research .nii.ac.jp/ src/ en/CENSREC- 1-C . html 

2http : / /hasc . jp/hc2011/ 


12 


r< 


'  Y(t)  ]©©'©(  Irb  P1E1I  Y(t  +  r) 

yu  +  nl©~©@i  PH  mi  y(t  +  r  + 1) 


I©  ©  @(  ETM \i]\Y(t  +  2r  -  1) 

T(t)  y{t  +  r) 


Figure  1:  Schematic  of  change-point  detection  in  time-series. 


2.4.2  Salient  Object  Detection  in  an  Image 

The  goal  is  to  find  salient  objects  in  an  image.  This  can  be  achieved  by  computing  a 
divergence  between  the  probability  distributions  of  image  features  (such  as  brightness, 
edges,  and  colors)  in  the  center  window  and  its  surroundings  [214],  This  divergence 
computation  is  swept  over  the  entire  image  with  changing  scales  (Figure  3). 

The  object  detection  results  on  the  MSRA  salient  object  database  [114]  by  the  rPE 
divergence  with  a  =  0.1  are  described  in  Figure  4,  where  pixels  in  gray-scale  saliency 
maps  take  brighter  color  if  the  estimated  divergence  value  is  large.  The  results  show  that 
visually  salient  objects  can  be  successfully  detected  by  the  divergence-based  approach. 

2.4.3  Measuring  Statistical  Independence 

The  goal  is  to  measure  how  strongly  two  random  variables  U  and  V  are  statistically 
dependent,  from  paired  samples  {(itj,  Uj)}"=  i  drawn  independently  from  the  joint  distri¬ 
bution  with  density  pu ,v(u,v).  Let  us  consider  a  divergence  between  the  joint  density 
pu.v  and  the  product  of  marginal  densities  p u  ■  pv-  This  actually  serves  as  a  measure  of 
statistical  independence,  because  U  and  V  are  independent  if  and  only  if  the  divergence 
is  zero  (i.e. ,  pu,v  =  Pu  ■  Pv),  and  the  dependence  between  U  and  V  is  stronger  if  the 
divergence  is  larger. 

Such  a  dependence  measure  can  be  approximated  in  the  same  way  as  ordinary  diver¬ 
gences  by  using  the  two  datasets  formed  as  X  =  {(it*,  Uj)}"=1  and  X'  =  {(it*,  Vj)}^=1. 
The  dependence  measure  based  on  the  KL  divergence  is  called  mutual  information  (MI) 
[147]: 


MI  :  = 


Pu,v(w,  v)  log 


pUy{u,V) 

pu(u)pv(v 


-dudu. 


MI  plays  a  central  role  in  information  theory  [38]. 

On  the  other  hand,  its  PE-divergence  variant  is  called  the  squared-loss  mutual  infor- 
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Figure  2:  Results  of  change-point  detection.  Original  time-series  data  is  plotted  in  the  top 
graphs,  and  change  scores  obtained  by  KLIEP  (Section  2.3.1)  and  LSDD  (Section  2.3.4) 
are  plotted  in  the  bottom  graphs. 
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Figure  3:  Schematic  of  salient  object  detection  in  an  image. 


mation  (SMI): 


SMI  :=  If  pu(u)pv(v)  f  PU,V|'^;  -  l)  dudv. 

JJ  \pu(u)pv(v)  ) 

SMI  is  useful  for  solving  various  machine  learning  tasks  [157],  including  independence  test¬ 
ing  [166],  feature  selection  [179,  77],  feature  extraction  [178,  204],  canonical  dependency 
analysis  [89],  object  matching  [208],  independent  component  analysis  [177],  clustering 
[175,  97],  and  causal  direction  estimation  [207]. 

An  L2-clistance  variant  of  the  dependence  measure  is  called  quadratic  mutual  infor¬ 
mation  (QMI)  [186]: 


QMI  :=  IJ  (pupils,,  v)  -  pu(u)pv(v)^J  dwdu. 

QMI  is  also  a  useful  dependence  measure  in  practice  [143]. 

2.5  Conclusions 

In  this  article,  we  reviewed  recent  advances  in  direct  divergence  approximation.  Direct 
divergence  approximators  theoretically  achieve  optimal  convergence  rates  both  in  para¬ 
metric  and  non-parametric  cases  and  experimentally  compare  favorably  with  the  naive 
density-estimation  counterparts  [124,  173,  82,  212,  172], 

However,  direct  divergence  approximators  still  suffer  from  the  curse  of  dimensionality. 
A  possible  cure  for  this  problem  is  to  combine  them  with  dimensionality  reduction,  based 
on  the  hope  that  two  probability  distributions  share  some  commonality  [159,  176,  209]. 
Further  investigating  this  line  would  be  a  promising  future  direction. 
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Figure  4:  Results  of  salient  object  detection  in  an  image.  Upper:  Original  images.  Lower: 
Obtained  saliency  maps  (brighter  color  means  more  salient). 
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Change  Detection 

3 . 1  Background 

Changes  in  interactions  between  random  variables  are  interesting  in  many  real-world  phe¬ 
nomena.  For  example,  genes  may  interact  with  each  other  in  different  ways  when  external 
stimuli  change,  co-occurrence  between  words  may  appear /disappear  when  the  domains  of 
text  corpora  shift,  and  correlation  among  pixels  may  change  when  a  surveillance  camera 
captures  anomalous  activities.  Discovering  such  changes  in  interactions  is  a  task  of  great 
interest  in  machine  learning  and  data  mining,  because  it  provides  useful  insights  into 
underlying  mechanisms  in  many  real- world  applications. 

In  this  paper,  we  consider  the  problem  of  detecting  changes  in  conditional  indepen¬ 
dence  among  random  variables  between  two  sets  of  data.  Such  conditional  independence 
structure  can  be  expressed  via  an  undirected  graphical  model  called  a  Markov  network 
(MN)  [23,  198,  98],  where  nodes  and  edges  represent  variables  and  their  conditional  de¬ 
pendencies,  respectively.  As  a  simple  and  widely  applicable  case,  the  pairwise  MN  model 
has  been  thoroughly  studied  recently  [136,  105].  Following  this  line,  we  also  focus  on  the 
pairwise  MN  model  as  a  representative  example. 

A  naive  approach  to  change  detection  in  MNs  is  the  two-step  procedure  of  first  es¬ 
timating  two  MNs  separately  from  two  sets  of  data  by  maximum  likelihood  estimation 
(MLE),  and  then  comparing  the  structure  of  the  learned  MNs.  However,  MLE  is  often 
computationally  intractable  due  to  the  normalization  factor  included  in  the  density  model. 
Therefore,  Gaussianity  is  often  assumed  in  practice  for  computing  the  normalization  fac¬ 
tor  analytically  [70],  though  this  Gaussian  assumption  is  highly  restrictive  in  practice.  We 
may  utilize  importance  sampling  [139]  to  numerically  compute  the  normalization  factor, 
but  an  inappropriate  choice  of  the  instrumental  distribution  may  lead  to  an  estimate  with 
high  variance  [201];  for  more  discussions  on  sampling  techniques,  see  references  [56]  and 
[72],  References  [75]  and  [63]  have  explored  an  alternative  approach  to  avoid  computing 
the  normalization  factor  which  are  not  based  on  MLE. 

However,  the  two-step  procedure  has  a  conceptual  weakness  that  structure  change  is 
not  directly  learned.  This  indirect  nature  causes  a  crucial  problem:  Suppose  that  we 
want  to  learn  a  sparse  structure  change.  For  learning  sparse  changes,  we  may  utilize  t\- 
regularized  MLE  [16,  52,  105],  which  produces  sparse  MNs  and  thus  the  change  between 
MNs  also  becomes  sparse.  However,  this  approach  does  not  work  if  each  MN  is  dense  but 
only  change  is  sparse. 

To  mitigate  this  indirect  nature,  the  fused-lasso  [183]  is  useful,  where  two  MNs  are 
simultaneously  learned  with  a  sparsity-inducing  penalty  on  the  difference  between  two  MN 
parameters  [218].  Although  this  fused-lasso  approach  allows  us  to  learn  sparse  structure 
change  naturally,  the  restrictive  Gaussian  assumption  is  still  necessary  to  obtain  the 
solution  in  a  computationally  tractable  way. 

The  nonparanormal  assumption  [111,  110]  is  a  useful  generalization  of  the  Gaussian 
assumption.  A  nonparanormal  distribution  is  a  semi- parametric  Gaussian  copula  where 
each  Gaussian  variable  is  transformed  by  a  monotone  non-linear  function.  Nonparanormal 
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Figure  5:  The  rationale  of  direct  structural  change  learning:  finding  the  difference  between 
two  MNs  is  a  more  specific  task  than  finding  the  entire  structures  of  those  two  networks, 
and  hence  should  be  possible  to  learn  with  less  data. 

distributions  are  much  more  flexible  than  Gaussian  distributions  thanks  to  the  feature- 
wise  non-linear  transformation,  while  the  normalization  factors  can  still  be  computed 
analytically.  Thus,  the  fused-lasso  method  combined  with  nonparanormal  models  would 
be  one  of  the  state-of-the-art  approaches  to  change  detection  in  MNs.  However,  the  fused- 
lasso  method  is  still  based  on  separate  modeling  of  two  MNs,  and  its  computation  for  more 
general  non-Gaussian  distributions  is  challenging. 

In  this  paper,  we  propose  a  more  direct  approach  to  structural  change  learning  in 
MNs  based  on  density  ratio  estimation  (DRE)  [169].  Our  method  does  not  separately 
model  two  MNs,  but  directly  models  the  change  in  two  MNs.  This  idea  follows  Vapnik’s 
principle  [195]: 

If  you  possess  a  restricted  amount  of  information  for  solving  some  problem, 
try  to  solve  the  problem  directly  and  never  solve  a  more  general  problem  as 
an  intermediate  step.  It  is  possible  that  the  available  information  is  sufficient 
for  a  direct  solution  but  is  insufficient  for  solving  a  more  general  intermediate 
problem. 

This  principle  was  used  in  the  development  of  support  vector  machines  (SVMs):  rather 
than  modeling  two  classes  of  samples,  SVM  directly  learns  a  decision  boundary  that  is 
sufficient  for  performing  pattern  recognition.  In  the  current  context,  estimating  two  MNs 
is  more  general  than  detecting  changes  in  MNs  (Figure  5).  By  directly  detecting  changes 
in  MNs,  we  can  also  halve  the  number  of  parameters,  from  two  MNs  to  one  MN-clifference. 

Another  important  advantage  of  our  DRE-based  method  is  that  the  normalization 
factor  can  be  approximated  efficiently,  because  the  normalization  term  in  a  density  ratio 
function  takes  the  form  of  the  expectation  over  a  data  distribution  and  thus  it  can  be 
simply  approximated  by  the  sample  average  without  additional  sampling.  Through  ex¬ 
periments  on  gene  expression  and  Twitter  data  analysis,  we  demonstrate  the  usefulness 
of  our  proposed  approach. 

The  remainder  of  this  paper  is  structured  as  follows.  In  Section  3.2,  we  formulate  the 
problem  of  detecting  structural  changes  and  review  currently  available  approaches.  We 
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then  propose  our  DRE-based  structural  change  detection  method  in  Section  3.3.  Results 
of  illustrative  and  real-world  experiments  are  reported  in  Section  3.4  and  Section  3.5, 
respectively.  Finally,  we  conclude  our  work  and  show  the  future  direction  in  Section  3.6. 

3.2  Problem  Formulation  and  Related  Methods 

In  this  section,  we  formulate  the  problem  of  change  detection  in  Markov  network  structure 
and  review  existing  approaches. 

3.2.1  Problem  Formulation 

Consider  two  sets  of  independent  samples  drawn  separately  from  two  probability  distri¬ 
butions  P  and  Q  on 


{«f>K  P  {«?>»  Q- 

We  assume  that  P  and  Q  belong  to  the  family  of  Markov  networks  (MNs)  consisting  of 
univariate  and  bivariate  factors3,  i.e.,  their  respective  probability  densities  p  and  q  are 
expressed  as 


P(x]6) 


Z(0) 


exp  Opvf(x 


{u) 


\u,v=l,u>v 


(17) 


where  x  =  (a;*-1), . . . ,  a;^)T  is  the  d-dimensional  random  variable,  T  denotes  the  transpose, 
9U  V  is  the  parameter  vector  for  the  elements  and  x^v\  and 

0  =  •  •  •  )  ^d,1^2,2J  '  '  '  i  0d,2i  '  ■  ■  i  Od,d)T 

is  the  entire  parameter  vector.  f{x^u\x^)  is  a  bivariate  vector- valued  basis  function. 
Z{6)  is  the  normalization  factor  defined  as 

z(0)=fex P(  Y1  KJ(x(")’xMndx- 

\u,v=l,u>v  / 


q(x;  6)  is  defined  in  the  same  way. 

Given  two  densities  which  can  be  parameterized  using  p(x;  6P)  and  q{x\  0®),  our  goal 
is  to  discover  the  changes  in  parameters  from  P  to  Q ,  i.e.,  0P  —  0® . 


3Note  that  the  proposed  algorithm  itself  can  be  applied  to  any  MNs  containing  more  than  two  elements 
in  each  factor. 
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3.2.2  Sparse  Maximum  Likelihood  Estimation  and  Graphical  Lasso 

Maximum  likelihood  estimation  (MLE)  with  group  l \  -regularization  has  been  widely  used 
for  estimating  the  sparse  structure  of  MNs  [144,  136,  105]: 


max 

e 


1 

Up 


np  d 

Y  logp(*f ;  0)  -  A  Y 


i=  1 


u,v=l,u>v 


(18) 


where  ||  •  ||  denotes  the  £2-norm-  As  A  increases,  ||0U,„||  may  drop  to  0.  Thus,  this  method 
favors  an  MN  that  encodes  more  conditional  independencies  among  variables. 

Computation  of  the  normalization  term  Z(6)  in  Eq.(lT)  is  often  computationally  in¬ 
tractable  when  the  dimensionality  of  x  is  high.  To  avoid  this  computational  problem,  the 
Gaussian  assumption  is  often  imposed  [52,  119].  More  specifically,  the  following  zero-mean 
Gaussian  model  is  used: 


p(*;0)  =  d(2^A  exp  (-^T0:e)  ’ 

where  ©  is  the  inverse  covariance  matrix  (a.k.a.  the  precision  matrix)  and  det(-)  denotes 
the  determinant.  Then  ©  is  learned  as 

max  [log det(©)  -  tr(©Sp)  -  A||0||i]  , 

where  Sp  is  the  sample  covariance  matrix  of  {x[ }”=1.  ||©||i  is  the  G-norrn  of  ©,  i.e.,  the 
absolute  sum  of  all  elements.  This  formulation  has  been  studied  intensively  in  [16],  and  a 
computationally  efficient  algorithm  called  the  graphical  lasso  (Glasso)  has  been  proposed 

[52]. 

Sparse  changes  in  conditional  independence  structure  between  P  and  Q  can  be  de¬ 
tected  by  comparing  two  MNs  estimated  separately  using  sparse  MLE.  However,  this 
approach  implicitly  assumes  that  two  MNs  are  sparse,  which  is  not  necessarily  true  even 
if  the  change  is  sparse. 


3.2.3  Fused-Lasso  (Flasso)  Method 

To  more  naturally  handle  sparse  changes  in  conditional  independence  structure  between 
P  and  Q,  a  method  based  on  fused-lasso  [183]  has  been  developed  [218].  This  method 
directly  sparsihes  the  difference  between  parameters. 

The  original  method  conducts  feature-wise  neighborhood  regression  [119]  jointly  for 
P  and  Q,  which  can  be  conceptually  understood  as  maximizing  the  local  conditional 
Gaussian  likelihood  jointly  on  each  feature  [136].  A  slightly  more  general  form  of  the 
learning  criterion  may  be  summarized  as 


max  [ff(0  +*?(*?)  -  AidlClIi  +  ||0?lh 

0P,e? 


Aoll  6> 


lJ  ) 
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where  tp  ( 6 )  is  the  log  conditional  likelihood  for  the  s-th  element  G  M  given  the  rest 
x^~s^  G 


nP 

€(0)  =  — ^2^gp(x[s)p\x(rs)p;G). 
np  “ 

1=1 

0.^(0)  is  dehned  in  the  same  way  as  £f(0). 

Since  the  Flasso-based  method  directly  sparsities  the  change  in  MN  structure,  it  can 
work  well  even  when  each  MN  is  not  sparse.  However,  using  other  models  than  Gaussian 
is  difficult  because  of  the  normalization  issue  described  in  Section  3.2.2. 

3.2.4  Nonparanormal  Extensions 

In  the  above  methods,  Gaussianity  is  required  in  practice  to  compute  the  normalization 
factor  efficiently,  which  is  a  highly  restrictive  assumption.  To  overcome  this  restriction, 
it  has  become  popular  to  perform  structure  learning  under  the  nonparanormal  settings 
[111,  110],  where  the  Gaussian  distribution  is  replaced  by  a  semi-parametric  Gaussian 
copula. 

A  random  vector  x  =  (V1), . . . ,  is  said  to  follow  a  nonparanormal  distribu¬ 

tion,  if  there  exists  a  set  of  monotone  and  differentiable  functions,  {/q(a;)}f=1,  such  that 
h{x)  =  (hi(x(1)), ...  ,hd(  x G1))T  follows  the  Gaussian  distribution.  Nonparanormal  dis¬ 
tributions  are  much  more  flexible  than  Gaussian  distributions  thanks  to  the  non-linear 
transformation  {hi(x)}f=1,  while  the  normalization  factors  can  still  be  computed  in  an 
analytical  way. 

However,  the  nonparanormal  transformation  is  restricted  to  be  element-wise,  which  is 
still  restrictive  to  express  complex  distributions. 

3.2.5  Maximum  Likelihood  Estimation  for  Non-Gaussian  Models  by 
Importance-Sampling 

A  numerical  way  to  obtain  the  MLE  solution  under  general  non-Gaussian  distributions  is 
importance  sampling. 

Suppose  that  we  try  to  maximize  the  log-likelihood4: 

nP 

4ile(0)  =  —  lo§ p(x( ; 

npU 

=  ~  6n,vf(4U)P ,  XiV)P)  -  l°g  [  eXP  6lvf(x{U\x(V))]  dax  (19) 

P  i=  1  u>v  ^  \u>v  / 

The  key  idea  of  importance  sampling  is  to  compute  the  integral  by  the  expectation 
over  an  easy-to-sample  instrumental  density  p'(x)  (e.g.,  Gaussian)  weighted  according  to 


4 From  here  on,  we  simplify  EZ,v=i,u>v  as  Eu>v 
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the  importance  l/p'(x).  More  specifically,  using  i.i.d.  samples  {*'}  ™'=l  p'(x),  the  last 

term  of  Eq.(19)  can  be  approximately  computed  as  follows: 


J  ,  f  ,,  ,eXP  &u>vdivf(xiU),x(V)))  A 

dx  =  log  p  (x) - = - ; — r - da? 

J  p'{x) 

1  ^  exp  (j2u>vdlvf(4U\4V))) 

~  iog — y.  — — 

71'  - J 


2=1 


p'W) 


We  refer  to  this  implementation  of  Glasso  as  IS-Glasso  below. 

However,  importance  sampling  tends  to  produce  an  estimate  with  large  variance  if  the 
instrumental  distribution  is  not  carefully  chosen.  Although  it  is  often  suggested  to  use  a 
density  whose  shape  is  similar  to  the  function  to  be  integrated  but  with  thicker  tails  as 
p ',  it  is  not  straightforward  in  practice  to  decide  which  p'  to  choose,  especially  when  the 
dimensionality  of  x  is  high  [201]. 

We  can  also  consider  an  importance-sampling  version  of  the  Flasso  method  (which  we 
refer  to  as  IS-Flasso)5 


Ce(«P)  +  «mle(«®)  -  M||9P||2  +  l|0QlD  -  A,  Y.  IlC  -  ClI  . 

U>  V 

where  both  ^mle(^P)  and  ^le(0°)  are  approximated  by  importance  sampling  for  non- 
Gaussian  distributions.  However,  in  the  same  way  as  IS-Glasso,  the  choice  of  instrumental 
distributions  is  not  straightforward. 


max 

eppQ 


3.3  Direct  Learning  of  Structural  Changes  via  Density  Ratio 
Estimation 

The  Flasso  method  can  more  naturally  handle  sparse  changes  in  MNs  than  separate  sparse 
MLE.  However,  the  Flasso  method  is  still  based  on  separate  modeling  of  two  MNs,  and 
its  computation  for  general  high-dimensional  non-Gaussian  distributions  is  challenging. 
In  this  section,  we  propose  to  directly  learn  structural  changes  based  on  density  ratio 
estimation  [169].  Our  approach  does  not  involve  separate  modeling  of  each  MN  and 
allows  us  to  approximate  the  normalization  term  efficiently  for  any  distributions. 

3.3.1  Density  Ratio  Formulation  for  Structural  Change  Detection 

Our  key  idea  is  to  consider  the  ratio  of  p  and  q: 


p{x]Op) 
q{x ;  6Q) 


eQ  ' 

U,V/ 


T 


f(x 


O)  x(v) 


5For  implementation  simplicity,  we  maximize  the  joint  likelihood  of  p  and  q,  instead  of  its  feature- wise 
conditional  likelihood.  We  also  switch  the  first  penalty  term  from  H  to  G- 
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Here  Kv  ~  6U,V  encodes  the  difference  between  P  and  Q  for  factor  f(x^u\x^),  i.e., 
0PV  —  0®v  is  zero  if  there  is  no  change  in  the  factor  f  (x^u\  x^) . 

Once  we  consider  the  ratio  of  p  and  q,  we  actually  do  not  have  to  estimate  6PV  and 
0®v]  instead  estimating  their  difference  6U  V  =  6PV  —  6cjv  is  sufficient  for  change  detection: 


where 


r(x\  6)  = 


N(0) 


\u>v  / 


N(0)  =  /  9(»exp  (  ^2e^,J(xiu\x{v))  )  dx. 


v u>v 


(20) 


The  normalization  term  N{6)  guarantees6 


/9(x  )r(-;»)d-  =  l. 


Thus,  in  this  density  ratio  formulation,  we  are  no  longer  modeling  p  and  q  separately, 
but  we  model  the  change  from  p  to  q  directly.  This  direct  nature  would  be  more  suitable 
for  change  detection  purposes  according  to  Vapnik’s  principle  that  encourages  avoidance 
of  solving  more  general  problems  as  an  intermediate  step  [195].  This  direct  formulation 
also  allows  us  to  halve  the  number  of  parameters  from  both  9P  and  6Ci  to  only  6. 

Furthermore,  the  normalization  factor  N(6)  in  the  density  ratio  formulation  can  be 

easily  approximated  by  the  sample  average  over  q(x),  because  N(0)  is  the 

6If  the  model  q(x;0 Q)  is  correctly  specified,  i.e.,  there  exists  9®*  such  that  q(x:0®*)  =  q(x),  then 
N(9)  can  be  interpreted  as  importance  sampling  of  Z(9P)  via  instrumental  distribution  q(x).  Indeed, 
since 


Z(«  >  =  /«(*) - ~,(X,99-) - dx • 


where  q(x:  9 ®*)  =  q(x),  we  have 


N(0P  -  0Q*) 


Z(9P) 

Z{9Q*) 


=  J  q(x)ex p 


( U ) 

J 


da;. 


This  is  exactly  the  normalization  term  N(0)  of  the  ratio  p(x;  9p)/q(x\  9®*).  However,  we  note  that  the 
density  ratio  estimation  method  we  use  in  this  paper  is  consistent  to  the  optimal  solution  in  the  model 
even  without  the  correct  model  assumption  [85].  An  alternative  normalization  term, 

N'{9,9q)  =  J  q{x;  9Q)r{x;  9)dx, 

may  also  be  considered,  as  in  the  case  of  MLE.  However,  this  alternative  form  requires  an  extra  parameter 
9®  which  is  not  our  main  interest. 
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expectation  over  q(x): 


N(0)  ~  —  exp  1^2  Ou,Vf(xiU)Q,x?)Q)  )  • 

^  1=1  \U>V  / 


3.3.2  Direct  Density-Ratio  Estimation 

Density  ratio  estimation  has  been  recently  introduced  to  the  machine  learning  community 
and  is  proven  to  be  useful  in  a  wide  range  of  applications  [169].  Here,  we  concentrate  on 
the  density  ratio  estimator  called  the  Kullback-Leibler  importance  estimation  procedure 
(KLIEP)  for  log-linear  models  [174,  188]. 

For  a  density  ratio  model  r(cc;  0),  the  KLIEP  method  minimizes  the  Kullback-Leibler 
divergence  from  p(x)  to  p{x)  =  q(x)r(x ;  0): 

KL[p||p]  =  [ p(x)  log  da; 

J  q{x)r(x\0) 

=  Const.  —  J  p{x)  log  r(x;  Q)dx.  (21) 

Note  that  our  density-ratio  model  (20)  automatically  satisfies  the  non- negativity  and 
normalization  constraints: 

r(x;0)  >  0  and  J  q(x)r(x;d)dx  =  1. 

In  practice,  we  maximize  the  empirical  approximation  of  the  second  term  in  Eq.(21): 


(6>)  =  —  log  r(xi> ;0) 

T)  L - * 


np  ^ 
1=1 


np  ^ 

1=1  U>V 


-  loS  (  5^exP  (  du,vf(XiU)Q’XiV)Q )  )  )  • 

\  l=1  \u>V  J  J 

Because  6kliep(0)  is  concave  with  respect  to  0,  its  global  maximizer  can  be  numeri¬ 
cally  found  by  standard  optimization  techniques  such  as  gradient  ascent  or  quasi-Newton 
methods.  The  gradient  of  6kliep  with  respect  to  Qu.v  is  given  by 


9u,v  ^KLIEP  (0)  =- -  ^2f{x<ia)P,  xjV)P) 

np  “ 

1=1 


^  Ei=i  exP  ( du',v'f(xi 


W)Q  ( v')Q 


))  f(xPQ,xPQ) 


nQ  l^j= 1  e  y/_^u">v"  uu",v"J  \xj  )  J 

which  can  be  computed  in  a  straightforward  manner  for  any  feature  vector  f{x^u\x^). 
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Dual 


Figure  6:  Schematics  of  primal  and  dual  optimization,  b  denotes  the  number  of  basis 
functions  and  T  denotes  the  number  of  factors.  Because  we  are  considering  pairwise 
factors,  T  =  0(d 2)  for  input  dimensionality  d. 


3.3.3  Sparsity-Inducing  Norm 

To  find  a  sparse  change  between  P  and  Q,  we  propose  to  regularize  the  KLIEP  solution 
with  a  sparsity-inducing  norm  J2U>V  f|0u,i>||-  Note  that  the  MLE  approach  sparsihes  both 
9P  and  9 ®  so  that  the  difference  9P  —  6®  is  also  sparsihed,  while  we  directly  sparsify  the 
difference  9P  —  9®]  thus  our  method  can  still  work  well  even  if  6P  and  6 ®  are  dense. 

In  practice,  we  may  use  the  following  elastic-net  penalty  [222]  to  better  control  over¬ 
fitting  to  noisy  data: 


max 

e 


^KLIEp(0)  —  Ai||0||2  —  A2  || 0u,v 


U>  V 


where  |d||2  penalizes  the  magnitude  of  the  entire  parameter  vector. 


(22) 


3.3.4  Dual  Formulation  for  High-Dimensional  Data 

The  solution  of  the  optimization  problem  (22)  can  be  easily  obtained  by  standard  sparse 
optimization  methods.  However,  in  the  case  where  the  input  dimensionality  d  is  high 
(which  is  often  the  case  in  our  setup),  the  dimensionality  of  parameter  vector  9  is  large, 
and  thus  obtaining  the  solution  can  be  computationally  expensive.  Here,  we  derive  a  dual 
optimization  problem  [27],  which  can  be  solved  more  efficiently  for  high-dimensional  9 
(Figure  6). 
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As  detailed  in  Appendix,  the  dual  optimization  problem  is  given  as 


where 


nQ 

min  22  ai  l°g  ai  + 

a=(ai,...,anQ)T  ,=1 


1 

x; 


Y,  max(0,  ||£u,„ 

U>V 


nQ 

subject  to  Qi, . . . ,  anQ  >  0  and  cq  =  1, 

1=1 


(23) 


^ U,V  Qu,V 
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The  primal  solution  can  be  obtained  from  the  dual  solution  as 


if  lie^ll  >  A2, 

0U,V=  <  Al  V  Uu,vW  J 

jo  if  If <  \2. 


(24) 


Note  that  the  dimensionality  of  the  dual  variable  ex.  is  equal  to  rig,  while  that  of 
9  is  quadratic  with  respect  to  the  input  dimensionality  d,  because  we  are  considering 
pairwise  factors.  Thus,  if  d  is  not  small  and  Uq  is  not  very  large  (which  is  often  the 
case  in  our  experiments  shown  later),  solving  the  dual  optimization  problem  would  be 
computationally  more  efficient.  Furthermore,  the  dual  objective  (and  its  gradient)  can  be 
computed  efficiently  in  parallel  for  each  (u,v),  which  is  a  useful  property  when  handling 
large-scale  MNs.  Note  that  the  dual  objective  is  differentiable  everywhere,  while  the 
primal  objective  is  not. 


3.4  Numerical  Experiments 

In  this  section,  we  compare  the  performance  of  the  proposed  KLIEP-based  method,  the 
Flasso  method,  and  the  Glasso  method  for  Gaussian  models,  nonparanormal  models,  and 
non-Gaussian  models.  Results  are  reported  on  datasets  with  three  different  underlying 
distributions:  multivariate  Gaussian,  nonparanormal,  and  non-Gaussian  “diamond”  dis¬ 
tributions.  We  also  investigate  the  computation  time  of  the  primal  and  dual  formulations 
as  a  function  of  the  input  dimensionality. 

3.4.1  Gaussian  Distribution 

First,  we  investigate  the  performance  of  each  method  under  Gaussianity. 

Consider  a  40-node  sparse  Gaussian  MN,  where  its  graphical  structure  is  characterized 
by  precision  matrix  ©p  with  diagonal  elements  equal  to  2.  The  off-diagonal  elements  are 
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randomly  chosen7  and  set  to  0.2,  so  that  the  overall  sparsity  of  ©p  is  25%.  We  then 
introduce  changes  by  randomly  picking  15  edges  and  reducing  the  corresponding  elements 
in  the  precision  matrix  by  0.1.  The  resulting  precision  matrices  ©p  and  ©^  are  used  for 
drawing  samples  as 

{zf}?_VVdW(0,(©p-‘)  and  {*?}£  ‘i?  V(0,  (©«)-■), 


where  A f(n,  S)  denotes  the  multivariate  normal  distribution  with  mean  /i  and  covariance 
matrix  £.  Datasets  of  size  n  =  rip  =  uq  =  50, 100  are  tested. 

We  compare  the  performance  of  the  KL1EP,  Flasso,  and  Glasso  methods.  Because  all 
methods  use  the  same  Gaussian  model,  the  difference  in  performance  is  caused  only  by 
the  difference  in  estimation  methods.  We  repeat  the  experiments  20  times  with  randomly 
generated  datasets  and  report  the  results  in  Figure  7. 

The  top  6  graphs  are  examples  of  regularization  paths8.  The  dashed  lines  represent 
changed  edges  in  the  ground  truth,  while  the  solid  lines  represent  unchanged  edges.  The 
top  row  is  for  n  =  100  while  the  middle  row  is  for  n  =  50.  The  bottom  3  graphs  are 
the  data  generating  distribution  and  averaged  precision-recall  (P-R)  curves  with  standard 
error  over  20  runs.  The  P-R  curves  are  plotted  by  varying  the  group-sparsity  control 
parameter  A2  with  Ai  =  0  in  KLIEP  and  Flasso,  and  by  varying  the  sparsity  control 
parameters  as  A  =  Ap  =  A^  in  Glasso. 

In  the  regularization  path  plots,  solid  vertical  lines  show  the  regularization  parameter 
values  picked  based  on  hold-out  data  {Sf  P  and  {xf}fA\°  Q  as  follows: 

•  KLIEP:  The  hold-out  log-likelihood  (HOLL)  is  maximized: 
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•  Glasso:  The  sum  of  HOLLs  for  p(x;  6)  and  q(x ;  6)  is  maximized: 
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7We  set  0Mi„  =  &VlU  for  not  breaking  the  symmetry  of  the  precision  matrix. 

8Paths  of  univariate  factors  are  omitted  for  clear  visibility. 
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When  n  =  100,  KLIEP  and  Flasso  clearly  distinguish  changed  (dashed  lines)  and 
unchanged  (solid  lines)  edges  in  terms  of  parameter  magnitude.  However,  when  the 
sample  size  is  halved  to  n  =  50,  the  separation  is  visually  rather  unclear  in  the  case  of 
Flasso.  In  contrast,  the  paths  of  changed  and  unchanged  edges  are  still  almost  disjoint  in 
the  case  of  KLIEP.  The  Glasso  method  performs  rather  poorly  in  both  cases.  A  similar 
tendency  can  be  observed  also  in  the  P-R  curve  plot:  When  the  sample  size  is  n  =  100, 
KLIEP  and  Flasso  work  equally  well,  but  KLIEP  gains  its  lead  when  the  sample  size  is 
reduced  to  n  =  50.  Glasso  does  not  perform  well  in  both  cases. 

3.4.2  Nonparanormal  Distribution 

We  post-process  the  Gaussian  dataset  used  in  Section  3.4.1  to  construct  nonparanormal 
samples.  More  specifically,  we  apply  the  power  function, 

h;\x)  =  sign(x)|x|  2, 

to  each  dimension  of  xp  and  xQ,  so  that  h(xp)  ~  J\T{ 0,  (0P)_1)  and  h(xQ)  ~ 

A/^O,^)-1). 

To  cope  with  the  non-linearity  in  the  KLIEP  method,  we  use  the  power  nonparanormal 
basis  functions  with  power  k  =  2,  3,  and  4: 

f{xi,Xj )  =  (sign(xj)|xi|fe,  sign(xj)|xj|fe,  1)T. 

Model  selection  of  k  is  performed  together  with  the  regularization  parameter  by  HOLL 
maximization.  For  Flasso  and  Glasso,  we  apply  the  nonparanormal  transform  as  described 
in  [111]  before  the  structural  change  is  learned. 

The  experiments  are  conducted  on  20  randomly  generated  datasets  with  n  =  50  and 
100,  respectively.  The  regularization  paths,  data  generating  distribution,  and  averaged 
P-R  curves  are  plotted  in  Figure  8.  The  results  show  that  Flasso  clearly  suffers  from  the 
performance  degradation  compared  with  the  Gaussian  case,  perhaps  because  the  number 
of  samples  is  too  small  for  the  complicated  nonparanormal  distribution.  Due  to  the  two- 
step  estimation  scheme,  the  performance  of  Glasso  is  poor.  In  contrast,  KLIEP  separates 
changed  and  unchanged  edges  still  clearly  for  both  n  =  50  and  n  =  100.  The  P-R  curves 
also  show  the  same  tendency. 

3.4.3  “Diamond”  Distribution  with  No  Pearson  Correlation 

In  the  experiments  in  Section  3.4.2,  though  samples  are  non-Gaussian,  the  Pearson  cor¬ 
relation  is  not  zero.  Therefore,  methods  assuming  Gaussianity  can  still  capture  some 
linear  correlation  between  random  variables.  Here,  we  consider  a  more  challenging  case 
with  a  diamond-shaped  distribution  within  the  exponential  family  that  has  zero  Pearson 
correlation  between  variables.  Thus,  the  methods  assuming  Gaussianity  cannot  extract 
any  information  in  principle  from  this  dataset. 
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(g)  Gaussian  distribution 


(h)  P-R  curve,  n  =  100 


(i)  P-R  curve,  n  =  50 


Figure  7:  Experimental  results  on  the  Gaussian  dataset. 
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(a)  KLIEP,  n  =  100 


(b)  Flasso,  n  =  100 
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KLIEP 


(g)  Nonparanormal  distribu-  (h)  P-R  curve,  n  =  100 

tion 


0.8 

.1  0.6 
w 

CD 

0.4 

0.2 


\  'l' 


'l 


- KLIEP 

—  -Flasso 
-■"Glasso 


l  1 

k  \  \ 

r  1-  1*- 


i. 


i.  *'•«.  k 


0.2  0.4  0.6  0.8 

recall 


(i)  P-R  curve,  n  =  50 


Figure  8:  Experimental  results  on  the  nonparanormal  dataset. 
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The  probability  density  function  of  the  diamond  distribution  is  defined  as  follows 
(Figure  9(a)): 


p(x)  oc  exp 


E2*.2 

i— 1 


E 


20xfxj 


(25) 


where  the  adjacency  matrix  A  describes  the  MN  structure.  Note  that  this  distribution 
cannot  be  transformed  into  a  Gaussian  distribution  by  any  nonparanormal  transforma¬ 
tions. 

We  set  d  =  9  and  rip  =  riq  =  5000.  A1’  is  randomly  generated  with  35%  sparsity,  while 
A ®  is  created  by  randomly  removing  edges  in  Ap  so  that  the  sparsity  level  is  dropped  to 
15%.  Samples  from  the  above  distribution  are  drawn  by  using  a  slice  sampling  method 
[122],  Since  generating  samples  from  high-dimensional  distributions  is  non-trivial  and 
time-consuming,  we  focus  on  a  relatively  low- dimensional  case.  To  avoid  sampling  error 
which  may  mislead  the  experimental  evaluation,  we  also  increase  the  sample  size,  so  that 
the  erratic  points  generated  by  accident  will  not  affect  the  overall  population. 

In  this  experiment,  we  compare  the  performance  of  KLIEP,  Flasso,  and  Glasso  with 
the  Gaussian  model,  the  power  nonparanormal  model,  and  the  polynomial  model: 
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The  univariate  polynomial  transform  is  defined  as  f(xi,  Xi)  =  f(xi,  0).  We  test  k  —  2,  3, 4 
and  choose  the  best  one  in  terms  of  HOLL.  The  Flasso  and  Glasso  methods  for  the 
polynomial  model  are  computed  by  importance  sampling,  i.e.,  we  use  the  IS-Flasso  and 
IS-Glasso  methods  (see  Section  3.2.5).  Since  these  methods  are  computationally  very 
expensive,  we  only  test  k  =  4  which  we  found  to  be  a  reasonable  choice.  We  set  the 
instrumental  distribution  p'  as  the  standard  normal  M{ 0, 1),  and  use  sample  {cc'}J£°00  ~  p' 
for  approximating  integrals,  p'  is  purposely  chosen  so  that  it  has  a  similar  “bell”  shape 
to  the  target  densities  but  with  larger  variance  on  each  dimension. 

The  averaged  P-R  curves  over  20  datasets  are  shown  in  Figure  9(e).  KLIEP  with  the 
polynomial  model  significantly  outperforms  all  the  other  methods,  while  the  IS-Glasso  and 
especially  IS-Flasso  give  better  result  than  the  KLIEP,  Flasso,  and  Glasso  methods  with 
the  Gaussian  and  nonparanormal  models.  This  means  that  the  polynomial  basis  function 
is  indeed  helpful  in  handling  completely  non-Gaussian  data.  However,  as  discussed  in 
Section  3.2.2,  it  is  difficult  to  use  such  a  basis  function  in  Glasso  and  Flasso  because 
of  the  computational  intractability  of  the  normalization  term.  Although  IS-Glasso  can 
approximate  integrals,  the  result  shows  that  such  approximation  of  integrals  does  not  lead 
to  a  very  good  performance.  In  comparison,  the  result  of  the  IS-Flasso  method  is  much 
improved  thanks  to  the  coupled  sparsity  regularization,  but  it  is  still  not  comparable  to 
KLIEP. 

The  regularization  paths  of  KLIEP  with  the  polynomial  model  illustrated  in  Fig¬ 
ure  9(b)  show  the  usefulness  of  the  proposed  method  in  change  detection  under  non- 
Gaussianity.  We  also  give  regularization  paths  obtained  by  the  IS-Flasso  and  IS-Glasso 
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(a)  Diamond  distribution 


(b)  KLIEP 


(c)  IS-Flasso  (d)  IS-Glasso 
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(e)  P-R  curve 

Figure  9:  Experimental  results  on  the  diamond  dataset.  “NPN”  and  “POLY”  denote 
the  nonparanormal  and  polynomial  models,  respectively.  Note  that  the  precision  rate  of 
100%  recall  for  a  random  guess  is  approximately  20%. 
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d 

Figure  10:  Comparison  of  computation  time  for  solving  primal  and  dual  optimization 
problems. 

methods  on  the  same  dataset  in  Figures  9(c)  and  9(d),  respectively.  The  graphs  show  that 
both  methods  do  not  separate  changed  and  unchanged  edges  well,  though  the  IS-Flasso 
method  works  slightly  better. 

3.4.4  Computation  Time:  Dual  versus  Primal  Optimization  Problems 

Finally,  we  compare  the  computation  time  of  the  proposed  KLIEP  method  when  solving 
the  dual  optimization  problem  (23)  and  the  primal  optimization  problem  (22).  Both  the 
optimization  problems  are  solved  by  using  the  same  convex  optimizer  minFunc9.  The 
datasets  are  generated  from  two  Gaussian  distributions  constructed  in  the  same  way  as 
Section  3.4.1.  150  samples  are  separately  drawn  from  two  distributions  with  dimension 
d  =  40,  50,  60,  70,  80.  We  then  perform  change  detection  by  computing  the  regularization 
paths  using  20  choices  of  A2  ranging  from  10-4  to  10°  and  fix  Ai  =  0.1.  The  results  are 
plotted  in  Figure  10. 

It  can  be  seen  from  the  graph  that  as  the  dimensionality  increases,  the  computation 
time  for  solving  the  primal  optimization  problem  is  sharply  increased,  while  that  for  solv¬ 
ing  the  dual  optimization  problem  grows  only  moderately:  when  d  =  80,  the  computation 
time  for  obtaining  the  primal  solution  is  almost  10  times  more  than  that  required  for 
obtaining  the  dual  solution.  Thus,  the  dual  formulation  is  computationally  much  more 
efficient  than  the  primal  formulation. 

3.5  Applications 

In  this  section,  we  report  the  experimental  results  on  a  synthetic  gene  expression  dataset 
and  a  Twitter  dataset. 


9http : //www . di . ens . f r/~mschmidt/Sof tware/minFunc . html 
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3.5.1  Synthetic  Gene  Expression  Dataset 

A  gene  regulatory  network  encodes  interactions  between  DNA  segments.  However,  the 
way  genes  interact  may  change  due  to  environmental  or  biological  stimuli.  In  this  experi¬ 
ment,  we  focus  on  detecting  such  changes.  We  use  SynTReN,  which  is  a  generator  of  gene 
regulatory  networks  used  for  benchmark  validation  of  bioinformatics  algorithms  [192], 

We  first  choose  a  sub-network  containing  13  nodes  from  an  existing  signaling  network 
in  Saccharomyces  cerevisiae  (shown  in  Figure  11(a)).  Three  types  of  interactions  are 
modeled:  activation  (ac),  deactivation  (re),  and  dual  (du).  50  samples  are  generated  in 
the  first  stage,  after  which  we  change  the  types  of  interactions  in  6  edges,  and  generate 
50  samples  again.  Four  types  of  changes  are  considered:  ac  — >  re,  re  — >■  ac,  du  — >■  ac,  and 
du  — >  re. 

We  use  KLIEP  and  IS-Flasso  with  the  polynomial  transform  function  for  k  £  {2,  3,4}. 
The  regularization  parameter  Ai  in  KLIEP  and  Flasso  is  tested  with  choices  Ai  £ 
{0.1,1,10}.  We  set  the  instrumental  distribution  p'  as  the  standard  normal  A/"(0,  J), 
and  use  sample  {cc'}™00  ~  p'  for  approximating  integrals  in  IS-Flasso. 

The  regularization  paths  on  one  example  dataset  for  KLIEP,  IS-Flasso,  and  the  plain 
Flasso  with  the  Gaussian  model  are  plotted  in  Figures  11(b),  11(c),  and  11(d),  respec¬ 
tively.  Averaged  P-R  curves  over  20  simulation  runs  are  shown  in  Figure  11(e).  We  can  see 
clearly  from  the  KLIEP  regularization  paths  shown  in  Figure  11(b)  that  the  magnitude 
of  estimated  parameters  on  the  changed  pairwise  interactions  is  much  higher  than  that 
of  the  unchanged  edges.  IS-Flasso  also  achieves  rather  clear  separation  between  changed 
and  unchanged  interactions,  though  there  are  a  few  unchanged  interactions  drop  to  zero 
at  the  final  stage.  Flasso  gives  many  false  alarms  by  assigning  non-zero  values  to  the 
unchanged  edges,  even  after  some  changed  edges  hit  zeros. 

Reflecting  a  similar  pattern,  the  P-R  curves  plotted  in  Figure  11(e)  show  that  the 
proposed  KLIEP  method  has  the  best  performance  among  all  three  methods.  We  can 
also  see  that  the  IS-Flasso  method  achieves  significant  improvement  over  the  plain  Flasso 
method  with  the  Gaussian  model.  The  improvement  from  Flasso  to  IS-Flasso  shows 
that  the  use  of  the  polynomial  basis  is  useful  on  this  dataset,  and  the  improvement  from 
IS-Flasso  to  KLIEP  shows  that  the  direct  estimation  can  further  boost  the  performance. 

3.5.2  Twitter  Story  Telling 

Finally,  we  use  KLIEP  and  Flasso  as  event  detectors  from  Twitter.  More  specifically, 
we  choose  the  Deepwater  Horizon  oil  spill 10  as  the  target  event,  and  we  hope  that  our 
method  can  recover  some  story  lines  from  Twitter  as  the  news  events  develop.  Counting 
the  frequencies  of  10  keywords  (BP,  oil,  spill,  Mexico,  gulf,  coast,  Hayward,  Halliburton, 
Transocean,  and  Obama),  we  obtain  a  dataset  by  sampling  4  times  per  day  from  February 
1st,  2010  to  October  15th,  2010,  resulting  in  1061  data  samples. 

We  segment  the  data  into  two  parts:  the  first  300  samples  collected  before  the  day  of  oil 
spill  (April  20th,  2010)  are  regarded  as  conforming  to  a  10- dimensional  joint  distribution 


10littp : / / en. Wikipedia. org/wiki/Deepwater_Horizon_oil_spill 
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(a)  Gene  regulatory  network 


(b)  KLIEP 


(c)  IS-Flasso 


(d)  Flasso  (e)  P-R  curve 

Figure  11:  Experiments  on  synthetic  gene  expression  datasets. 
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Q,  while  the  second  set  of  samples  that  are  in  an  arbitrary  50-day  window  after  the 
oil  spill  accident  happened  is  regarded  as  following  distribution  P.  Thus,  the  MN  of  Q 
encodes  the  original  conditional  independence  of  frequencies  between  10  keywords,  while 
the  underlying  MN  of  P  has  changed  since  an  event  occurred.  We  expect  that  unveiling 
changes  in  MNs  between  P  and  Q  can  recover  the  drift  of  popular  topic  trends  on  Twitter 
in  terms  of  the  dependency  among  keywords. 

The  detected  change  graphs  (i.e.,  the  graphs  with  only  detected  changing  edges)  on 
10  keywords  are  illustrated  in  Figure  12.  The  edges  are  selected  at  a  certain  value  of 
A2  indicated  by  the  maximal  cross-validated  log-likelihood  (CVLL).  Since  the  edge  set 
that  is  picked  by  CVLL  may  not  be  sparse  in  general,  we  sparsify  the  graph  based  on 
the  permutation  test  as  follows:  we  randomly  shuffle  the  samples  between  P  and  Q  and 
repeatedly  run  change  detection  algorithms  for  100  times;  then  we  observe  detected  edges 
by  CVLL.  Finally,  we  select  the  edges  that  are  detected  using  the  original  non-shuffled 
dataset  and  remove  those  that  were  detected  in  the  shuffled  datasets  for  more  than  5 
times  (i.e.,  the  significance  level  5%).  I11  Figure  12,  we  plot  detected  change  graphs  which 
are  generated  using  samples  of  P  starting  from  April  17th,  July  6th,  and  July  26th, 
respectively. 

The  initial  explosion  happened  on  April  20th,  2010.  Both  methods  discover  depen¬ 
dency  changes  between  keywords.  Generally  speaking,  KLIEP  captures  more  conditional 
independence  changes  between  keywords  than  the  Flasso  method,  especially  when  com¬ 
paring  Figure  12(c)  and  Figure  12(f).  At  the  first  two  stages  (Figures  12(a),  12(b),  12(d) 
and  12(e)),  the  keyword  “Obama”  is  very  well  connected  with  other  keywords  in  the  re¬ 
sults  given  by  both  methods.  Indeed,  at  the  early  development  of  this  event,  he  lies  in  the 
center  of  the  news  stories,  and  his  media  exposure  peaks  after  his  visit  to  the  Louisiana 
coast  (May  2nd,  May  28nd,  and  June  5th)  and  his  meeting  with  BP  CEO  Tony  Hay¬ 
ward  on  June  16th.  Notably,  both  methods  highlight  the  “gulf-obama-coast”  triangle  in 
Figures  12(a)  and  12(d)  and  the  “bp-obama-hayward”  chain  in  Figures  12(b)  and  12(e). 

However,  there  are  some  important  differences  worth  mentioning.  First,  the  Flasso 
method  misses  the  “transocean-hayward-obama”  triangle  in  Figures  12(d)  and  12(e). 
Transocean  is  the  contracted  operator  in  the  Deepwater  Horizon  platform,  where  the 
initial  explosion  happened.  On  Figure  12(c),  the  chain  “bp-spill-oil”  may  indicate  that 
the  phrase  “bp  spill”  or  “oil  spill”  has  been  publicly  recognized  by  the  Twitter  community 
since  then,  while  the  “hayward-bp-mexico”  triangle,  although  relatively  weak,  may  link 
to  the  event  that  Hayward  stepped  down  from  the  CEO  position  on  July  27th. 

It  is  also  noted  that  Flasso  cannot  find  any  changed  edges  in  Figure  12(f),  perhaps 
due  to  the  Gaussian  restriction. 

3.6  Discussion,  Conclusion,  and  Future  Works 

In  this  paper,  we  proposed  a  direct  approach  to  learning  sparse  changes  in  MNs  by  den¬ 
sity  ratio  estimation.  Rather  than  fitting  two  MNs  separately  to  data  and  comparing 
them  to  detect  a  change,  we  estimated  the  ratio  of  the  probability  densities  of  two  MNs 
where  changes  can  be  naturally  encoded  as  sparsity  patterns  in  estimated  parameters. 
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(d)  April  17th  June  5th,  (e)  June  6th-July  25th, 

Flasso  Flasso 


(f)  July  26th-Sept. 
14th,  Flasso 


Figure  12:  Change  graphs  captured  by  the  proposed  KLIEP  method  (top)  and  the  Flasso 
method  (bottom).  The  date  range  beneath  each  figure  indicates  when  P  was  sampled, 
while  Q  is  fixed  to  dates  from  February  1st  to  April  20th.  Notable  structures  shared  by 
the  graph  of  both  methods  are  surrounded  by  the  dash-dotted  lines.  Unique  structures 
that  only  appear  in  the  graph  of  the  proposed  KLIEP  method  are  surrounded  by  the 
dashed  lines. 


This  direct  modeling  allows  us  to  halve  the  number  of  parameters  and  approximate  the 
normalization  term  in  the  density  ratio  model  by  a  sample  average  without  sampling.  We 
also  showed  that  the  number  of  parameters  to  be  optimized  can  be  further  reduced  with 
the  dual  formulation,  which  is  highly  useful  when  the  dimensionality  is  high.  Through 
experiments  on  artificial  and  real-world  datasets,  we  demonstrated  the  usefulness  of  the 
proposed  method  over  state-of-the-art  methods  including  nonparanormal-based  methods 
and  sampling-based  methods. 

Our  important  future  work  is  to  theoretically  elucidate  the  advantage  of  the  proposed 
method,  beyond  the  Vapnik’s  principle  of  solving  the  target  problem  directly.  The  rela¬ 
tion  to  score  matching  [75],  which  avoids  computing  the  normalization  term  in  density 
estimation,  is  also  an  interesting  issue  to  be  further  investigated.  Considering  higher- 
order  MN  models  such  as  the  hierarchical  log-linear  model  [144]  is  a  promising  direction 
for  extension. 

In  the  context  of  change  detection,  we  are  mainly  interested  in  the  situation  where 
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p  and  q  are  close  to  each  other  (if  p  and  q  are  completely  different,  it  is  straightforward 
to  detect  changes).  When  p  and  q  are  similar,  density  ratio  estimation  for  p(x)/q(x) 
or  q{x)/p{x)  perform  similarly.  However,  given  the  asymmetry  of  density  ratios,  the 
solutions  for  p{x)/q{x)  or  q{x)/p{x )  are  generally  different.  The  choice  of  the  numerator 
and  denominator  in  the  ratio  is  left  for  future  investigation. 

Detecting  changes  in  MNs  is  the  main  target  of  this  paper.  On  the  other  hand, 
estimating  the  difference/divergence  between  two  probability  distributions  has  been  stud¬ 
ied  under  a  more  general  context  in  the  statistics  and  machine  learning  communities 
[8,  49,  200,  171,  161].  In  fact,  the  estimation  of  the  Kullback-Leibler  divergence  [103] 
is  related  to  the  KLIEP-type  density  ratio  estimation  method  [125],  and  the  estimation 
of  the  Pearson  divergence  [129]  is  related  to  the  squared-loss  density  ratio  estimation 
method  [83].  However,  the  density  ratio  based  divergences  tend  to  be  sensitive  to  out¬ 
liers.  To  overcome  this  problem,  a  divergence  measure  based  on  relative  density  ratios 
was  introduced,  and  its  direct  estimation  method  was  developed  [212],  L2-distance  is 
another  popular  difference  measure  between  probability  density  functions.  L2-distance  is 
symmetric,  unlike  the  Kullback-Leibler  divergence  and  the  Pearson  divergence,  and  its 
direct  estimation  method  has  been  investigated  recently  [172,  96]. 

Change  detection  in  time-series  a  related  topic.  A  straightforward  approach  is  to 
evaluate  the  difference  (dissimilarity)  between  two  consecutive  segments  of  time-series 
signals.  Various  methods  have  been  developed  to  identify  the  difference  by  fitting  two 
models  to  two  segments  of  time-series  separately,  e.g.,  the  singular  spectrum  transform 
[120,  76],  subspace  identification  [94],  and  the  method  based  on  the  one-class  support 
vector  machine  [42],  In  the  same  way  as  the  current  paper,  directly  modeling  of  the 
change  has  also  been  explored  for  change  detection  in  time-series  [93,  113,  172], 

4  Learning  under  Non-Stationarity 

4.1  Background 

The  goal  of  supervised  learning  such  as  regression  and  classification  is  to  learn  an  input- 
output  dependency  from  input-output  paired  training  samples  so  that  test  output  y'  for 
unseen  test  input  x'  can  be  accurately  estimated.  Various  supervised  learning  algorithms 
were  developed  thus  far,  and  they  have  been  demonstrated  to  be  useful  in  a  wide  range  of 
applications.  Most  of  the  popular  machine  learning  algorithms  assume  that  training  and 
test  data  follow  the  same  probability  distribution,  based  on  which  learning  machines  can 
generalize  to  unseen  test  data  from  training  data  [194,  71,  24],  However,  this  fundamental 
assumption  is  often  violated  in  practice,  and  this  causes  standard  supervised  learning 
algorithms  suffer  significant  estimation  bias. 

In  this  article,  we  consider  two  scenarios.  The  Erst  setup  is  the  covariate  shift  [150, 
158],  where  training  and  test  input  data  follow  different  distributions  but  the  input-output 
relation  does  not  change  between  training  and  test  phases.  The  other  setup  is  called  class- 
balance  change  in  classification  [142,  45], (where  the  class-prior  probabilities  are  different 
in  training  and  test  phases  but  the  input  distribution  of  each  class  does  not  change.  For 
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these  two  scenarios,  we  review  semi-supervised  adaptation  techniques,  where  importance 
weighting  plays  an  essential  role. 

More  specifically,  we  consider  the  semi-supervised  learning  problem  where  input- 
output  training  samples  {(ccj, y,)}”=1  and  input-only  test  samples  {cc',}”=1  are  available. 
In  the  standard  semi-supervised  learning  setup,  training  and  test  samples  are  regarded 
as  being  drawn  from  the  same  probability  distribution  [31].  In  contrast,  in  this  article, 
we  suppose  that  they  are  drawn  from  different  distributions:  { (as*,  2/i)}"=1  are  drawn  in¬ 
dependently  from  a  joint  probability  distribution  with  density  p(x,y )  and  {cc',}™=1  are 
drawn  independently  from  a  marginal  probability  distribution  with  density  f  p'(x,y)dy, 
where  p(x,y)  and  p’(x,  y )  are  different: 

P(x,y)  7^  P\x,y ). 

Our  goal  is  to  learn  the  input-output  relation  for  test  samples.  The  situation  where  train¬ 
ing  and  test  samples  follow  different  distributions  is  also  referred  to  as  non-stationarity 
adaptation ,  dataset-shift  adaptation,  transfer  learning ,  and  domain  adaptation.  The  semi- 
supervised  learning  setup  with  differing  training  and  testing  distributions  is  sometimes 
called  unsupervised  transfer  or  unsupervised  adaptation  in  literature  because  no  supervi¬ 
sion  is  available  from  the  test  domain. 

4.2  Adaptation  Techniques  for  Covariate  Shift 

The  covariate  shift  [150,  158]  is  the  situation  where  input  distributions  change  but  the 
conditional  distribution  of  outputs  given  inputs  remains  unchanged: 

p(x)^p'(x)  and  p(y\x)  —  p'(y\x). 

Figure  13  illustrates  an  example  of  covariate  shift  regression:  Training  input  samples 
{xi}\ Li  are  drawn  from  the  left-hand  side  of  the  domain,  whereas  test  input  samples 
{ccj,  }”=1  are  drawn  from  the  right-hand  side.  This  problem  is  similar  to  extrapolation 
since  the  prediction  is  made  in  a  low  density  region  of  the  training  set. 

4.2.1  Importance- Weighted  Learning 

For  this  covariate-shift  regression  problem,  let  us  use  a  simple  linear  model, 

M(x)  =  9i  +  02x, 

and  train  this  model  by  ordinary  least-squares: 

n  2 

T^(%)-K)  . 

1=1 

The  learned  result  illustrated  in  Figure  14(a)  shows  that  the  obtained  function  fits  the 
training  samples  {(a;*,  2/®)}it=1  very  well,  but  it  does  not  give  good  prediction  of  outputs 
for  the  test  input  samples  {cc',}”=1  (i.e.,  samples  denoted  by  “x”). 
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(a)  Input  densities  and  impor¬ 
tance 


0  12  3 


x 

(b)  Learning  target  function, 
training  samples,  and  test  sam¬ 
ples 


Figure  13:  Covariate  shift.  Input  distributions  change  but  the  conditional  distribution  of 
outputs  given  inputs  does  not  changed. 


(a)  Ordinary  least-squares 


(b)  Importance-weighted  least- 
squares 


Figure  14:  Regression  under  covariate  shift.  Dashed  lines  denote  learned  functions. 


Under  the  covariate  shift,  it  is  expected  that  only  training  samples  whose  input  points 
are  close  to  test  input  samples  {aj',}"=1  are  useful.  This  intuitive  idea  can  be  realized  by 
weighting  the  training  loss  according  to  the  importance ,  which  is  the  ratio  between  p'(x ) 
and  p(x). 


w(x) 


p\x) 
P(x)  ' 


In  Figure  14(b),  the  learned  result  obtained  by  importance-weighted,  least-squares  [150], 


min 

e 


lb  ^ 

^w{xi)(M{xi)  -  y^j  , 
2=1 
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is  illustrated.  This  shows  that  importance  weighting  can  improve  the  accuracy  of  predict¬ 
ing  outputs  for  the  test  input  samples  {cc',}”=1. 

The  above  importance-weighted  least-squares  can  be  regarded  as  an  application  of 
importance  weighting  to  approximating  the  generalization  error  (or  the  expected  test 
loss) : 


G  := 


loss (y,  M(x))p\x ,  y)dxdy, 


where  loss(y,  y)  denotes  a  point-wise  loss  when  y  is  predicted  by  y.  More  specifically,  the 
generalization  error  G  can  be  approximated  by  the  importance-weighted  average  of  the 
training  loss: 


G  = 


n 


\oss(y,  M  (x))p'  (y\x)p'  (x)dxdy 
p'  (xj 

loss (y,  M(x))p'(y\x )  .  p(x)dxdy 

pyx) 

loss (y,  M(x))w(x)p(x,  y)dxdy 


y  loss (yi,  M(Xi))w(Xi). 

i= 1 


Note  that  this  importance  weighting  idea  can  be  applied  to  any  likelihood/loss-based 
learning  algorithms,  including  Fisher  discriminant  analysis,  logistic  regression,  the  support 
vector  machine,  boosting,  and  the  conditional  random  field,  and  it  also  plays  an  important 
role  for  reducing  the  estimation  bias  in  active  learning  and  experimental  design  scenarios 
[202,  84,  155,  81,  165,  163].  See  [158]  for  more  thorough  discussion  on  importance-weighted 
learning. 

To  implement  importance-weighted  learning,  importance  values  {w(Xi)}^=1  are  neces¬ 
sary.  However,  training  and  test  input  densities  p(x)  and  p'(x)  are  unknown  in  practice, 
and  thus  the  importance  values  should  be  estimated  from  data.  A  naive  approach  is  to 
estimate  p(x)  from  {cc,;}”=1  and  p'(x)  from  {cc(,  }"=1  separately  and  then  take  their  ratio. 
However,  such  a  two-step  procedure  is  not  accurate  because  the  error  incurred  in  the 
estimation  of  p(x)  and  p'( x)  can  be  increased  when  their  ratio  is  computed  in  the  second 
stage.  Thus,  directly  estimating  the  ratio  w(x)  without  estimating  p(x)  and  p'( x)  is  more 
preferable. 

Following  this  idea,  various  methods  of  importance  estimation  have  been  developed, 
for  example,  based  on  density  estimation  of  p'(x)  after  uniformization  of  p(x)  [40,  32], 
logistic  regression  for  discriminating  data  from  p(x)  and  p'( x)  [131,  33,  20],  moment 
matching  between  p'(x)  and  p(x)w(x)  [131,  62,  87],  integral  equations  between  p'(x)  and 
p(x)w(x)  [196,  132],  density  matching  between  p'(x)  and  p(x)w(x)  under  the  Kullback- 
Leibler  divergence  [173,  124,  187,  206,  211],  least-squares  importance  fitting  of  w(x)  to 
j/(x)/p(x)  [82,  87],  and  importance  fitting  of  w(x)  to  p'(x)/p(x)  under  the  Bregman 
divergence  [170]. 


41 


Among  them,  the  least-squares  importance  fitting  method  has  various  practical  advan¬ 
tages,  for  example,  an  analytic-form  solution  that  can  be  computed  efficiently  is  available, 
cross-validation  is  available  for  hyperparameter  tuning,  the  optimal  convergence  rate  is 
achieved  both  in  parametric  and  non-parametric  settings  [82,  87],  and  the  highest  numeri¬ 
cal  stability  in  terms  of  condition  numbers  is  achieved  among  a  class  of  importance  estima¬ 
tors  [88].  Furthermore,  dimensionality  reduction  methods  for  improving  the  accuracy  of 
importance  estimation  in  high-dimensional  problems  have  been  developed  [159,  176,  209]. 
See  [168]  for  more  comprehensive  discussion  on  direct  importance  estimation. 


4.2.2  Relative  Importance- Weighted  Learning 


Let  us  continue  using  the  illustrative  example  described  in  Figure  13  and  Figure  14.  The 
true  importance  function  w(x)  is  plotted  in  Figure  13(a).  This  shows  that,  among  many 
training  samples,  only  a  small  number  of  samples  at  around  x  =  2  have  large  impor¬ 
tance  weights  and  other  samples  have  almost  zero  weights.  This  implies  that  importance- 
weighted  learning  in  this  example  is  rather  unreliable  because  the  learned  function  is 
essentially  obtained  from  only  a  few  training  samples. 

Such  unreliable  behavior  is  caused  by  the  fact  that  the  importance  function  w{x)  can 
take  very  large  values.  To  cope  with  this  problem,  the  relative  importance  weight  is  useful 
[212]: 

WW(X)  = _ kM _ 

^  1  pp'(x)  +  (i-i3)p(xy 


where  /3  G  [0,1]  is  the  relativity  parameter.  The  relative  importance  weight  w^\x)  is 
reduced  to  the  ordinary  importance  weight  w(x )  when  (3  =  0.  As  /3  is  increased,  the 
relative  importance  weight  gets  flatter  and  is  reduced  to  the  uniform  weight  x)  =  1 
when  (3=1  (Figure  15).  The  non-negativity  of  the  importance  function,  p'(x)/p(x)  >  0, 
assures  that  the  relative  importance  weight  is  bounded  from  above  by  1//3: 


w 


W(x)  = 


£+(i -em 


i 

— —  <  -. 
p(x)  —  p 


The  least-squares  method  combined  with  the  relative  importance  weight  is  called  relative 
importance-weighted  least-squares: 


min  -  ^  w^\xi)  (M{Xi)  -  y^J  , 

Z— 1 


where  the  relativity  parameter  /3  controls  the  trade-off  between  bias  and  variance. 

Now  let  us  consider  the  problem  of  estimating  the  relative  importance  weight  w^\x) 
from  {xi\™=l  and  {cc',}"=1.  We  use  the  following  linear-in-parameter  model  wa(x)  for 
learning  the  relative  importance  weight  w^\x): 


b 

wa(x)  =  ^otjipjix)  =  aJ'ipix), 

3  = 1 
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(a)  Probability  densities 


(b)  Relative  importance  w G)  ( x ) 


Figure  15:  Relative  importance.  p'(x)  is  the  normal  distribution  with  mean  0  and  variance 
1,  and  p(x)  is  the  normal  distribution  with  mean  0.5  and  variance  1. 


where  a:  =  (an, . . . ,  ab)T  is  the  parameter  vector  and  t/?(cc)  =  (ijji(x), . . . ,  -0fe(^))T  is  the 
basis  function  vector.  As  basis  functions,  we  may  use,  for  example,  the  Gaussian  kernels: 


wa(x)  =  E  aj  exp 

3  =  1 

where  a2  denotes  the  Gaussian  width. 


\x  —  x 


/  112 


2a2 


Then  the  parameter  ol  is  learned  so  that  the  following  criterion  J(ct)  is  minimized: 
J(a)  =  j  (wa(x)  —  w^\x)Sj  (j3p'(x)  +  (1  —  /3)p(x)^dx 

=  j  ctT'il>(x)'il>(x)T  ct(^f3p'(x)  +  (1  —  P)p{x)^  dx 
—  2  /  aT^3(x)p'(x)dx  +  C, 


where  the  third  term, 


C 


is  a  constant  irrelevant  to  the  parameter  ot  and  thus  can  be  ignored.  Approximating 
the  expectations  in  the  first  and  second  terms  by  sample  averages  and  adding  the  i2- 
regularizer,  we  have  the  following  training  criterion: 


min 

OL 


olt  G/3(X  —  2  ctTh  +  A  ||  ck  || 2 


1 


where  Gp  and  h,  a  6  x  6  matrix  and  a  6-dimensional  vector,  are  defined  as 


_  a  n  1-/1  ^ 

Gp  =  —  and  h 

n  z — '  n 


1 

n’ 


i'= 1 
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(a)  Training  and  test  data  (b)  Relative  importance  weight 

(P  =  0-5) 


Figure  16:  Illustration  of  RuLSIF.  “x”  in  Figure  16(b)  denotes  an  estimated  relative 
importance  value  at  xt. 


This  training  criterion  is  a  convex  quadratic  function  of  ct  and  its  minimize!’  6t  can  be 
obtained  analytically  as 


a  =  (gi3  +  A  j)  1  h. 

This  method  is  called  relative  unconstrained  least-squares  importance  fitting  (RuLSIF) 
[212].  Tuning  parameters  such  as  the  regularization  parameter  A  and  the  Gaussian  width 
a2  can  be  optimized  via  cross-validation  with  respect  to  J. 

An  example  of  relative  importance  estimation  by  RuLSIF  is  illustrated  in  Figure  16. 

4.2.3  Importance- Weighted  Model  Selection 

Choice  of  the  relativity  parameter  /3  as  well  as  other  tuning  parameters  such  as  basis 
functions  and  regularization  parameters  is  crucial  for  obtaining  better  performance  in 
practice.  For  model  selection,  various  methods  such  as  the  Akaike  information  crite¬ 
rion  [3],  the  subspace  information  criterion  [164],  and  cross-validation  [154]  are  available. 
However,  under  the  covariate  shift,  these  model  selection  techniques  based  on  training 
samples  {(as*,  2/*)}it=x  do  no^  &iye  V£did  evaluation  of  the  prediction  accuracy  of  outputs 
for  test  inputs  {cc',}”=1. 

Under  the  covariate  shift,  importance-weighted  variants  of  such  model  selection  meth¬ 
ods  are  useful  [150,  162,  160].  The  simplest  model  selection  method  called  importance- 
weighted  cross-validation  is  given  as  follows: 

1.  Randomly  split  training  samples  T  =  {(cc*,  ?/i)}f=1  into  m  disjoint  subsets  {Ti}7fk1 
of  (approximately)  the  same  size. 

2.  Repeat  for  i  —  1, . . . ,  m; 
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(a)  Obtain  a  learned  function  ft  from  T\%  (i.e.,  all  samples  without  7[j. 

(b)  Evaluate  the  generalization  error  using  hold-out  samples  71  as 

2 


Gi  =  ( 


( Vr\  ^  w(x)(fi(x)-y 


(Regression), 


1 

ill 


E 


w(x 


1  —  sign  ( fi(x)  y)j  (Classification), 


(■ x,y)eTi 

where  |7*|  denotes  the  number  of  elements  in  the  set  7j. 

3.  Output  the  average  of  ,  Grn  as  the  final  evaluation  G  of  the  generalization 

error: 


d  =  ~ydi. 

m  G= 


1=1 


4.2.4  Applications 

Importance-weighted  learning  has  been  successfully  applied  to  various  real-world  prob¬ 
lems,  including  brain-computer  interface  [160,  107],  robot  control  [64,  4,  65,  221],  speaker 
identification  [210],  age  prediction  from  face  images  [190],  activity  recognition  from  ac¬ 
celerometers  [66],  natural  language  processing  [187],  spam  filtering  [22],  targeted  adver¬ 
tising  [21],  HIV  therapy  screening  [19],  and  wafer  alignment  in  semiconductor  exposure 
apparatus  [163].  Below,  we  describe  application  of  covariate  shift  adaptation  in  3D  human- 
pose  estimation  from  monocular  videos  [205]. 

We  use  the  HumanEva-I  dataset  [151],  which  contains  synchronized  multi- view  videos 
and  motion-capture  data  for  3  subjects  performing  multiple  activities:  Walking,  jogging, 
boxing,  throwing  and  catching,  and  gesturing.  As  input  x ,  we  extract  the  histogram- 
of- oriented- gradient  (HoG)  feature  [26]  of  270  dimensions  from  videos  taken  by  3  color 
cameras  with  9630  image-pose  frames  for  each  camera.  Output  y  is  a  corresponding 
pose  vector,  which  means  that  we  consider  a  multi-dimensional  regression  problem.  We 
randomly  select  n  samples  from  the  set  of  3  x  4815  =  14445  frames  for  training  and  use 
the  remaining  14445  frames  for  testing. 

We  consider  the  following  scenarios: 

Selection  bias:  The  training  set  contains  data  from  all  3  subjects,  whereas  the  test  set 
only  contains  data  from  a  single  subject. 

Subject  transfer:  The  training  set  contains  data  from  2  subjects,  whereas  the  test  set 
contains  data  from  the  remaining  subject  not  included  in  the  training  set. 

As  regression  algorithms,  we  use  kernel  regression  (KR)  [2],  twin  Gaussian  processes 
regression  (TGP)  [26],  and  the  weighted  k-nearest  neighbor  (WkNN)  method  [146].  See 
[205]  for  the  details  of  these  algorithms.  For  KR  and  TGP,  we  consider  their  importance- 
weighted  variants  which  are  referred  to  as  IWKR  and  IWTGP. 
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Each  pose  is  represented  by  20  3D-joint  markers:  y  =  [f/1)1", . . . ,  ?/20)T]T  e  M60,  where 
y(m>  e  M3  for  m  —  1, . . . ,  20.  Error  between  true  pose  y*  and  its  estimate  y  is  measured 
by  the  average  Euclidean  distance: 


1 

Error (t/*,  y)  =  —  ^  || y(m)  -  y*[m)  ||. 

m=l 

Figure  17  shows  the  pose  estimation  error  as  a  function  of  the  training  sample  size  n 
averaged  over  all  motions  and  10  runs.  The  graphs  clearly  show  that  IWTGP  and  IWKR 
outperform  their  non-adaptive  counterparts  and  the  baseline  WkNN  method. 


4.3  Adaptation  Techniques  for  Class-Balance  Change 

Class-balance  change  [142,  45]  is  the  classification  problem  where  class-prior  probabilities 
change  but  the  conditional  distribution  of  input  x  given  class  y  remains  unchanged: 

p(y)¥zP'(y)  and  p(x\y)  =  p'(x\y).  (26) 


Figure  18  illustrates  an  example  of  classification  under  class-balance  change.  When  the 
class  balances  are  different  in  the  training  and  test  phases,  naive  training  of  a  classifier 
yields  significant  estimation  bias  even  if  the  class-conditional  input  density  is  unchanged. 

In  the  same  way  as  covariate  shift  adaptation,  estimation  bias  caused  by  class-balance 
change  can  be  canceled  by  weighting  the  training  loss  according  to  the  class-balance  ratio: 


w(y) 


p'(y ) 
p(y)  ’ 


Below,  we  focus  on  binary  classification  where  label  y  takes  either  +1  or  —1  for  sim¬ 
plicity. 


4.3.1  Class-Balance  Estimation 

The  training  class-balance  p(y)  can  be  naively  estimated  by  ny/n  if  ny  samples  belong  to 
class  y  in  the  training  set  {(a?*,  yi)}™=1.  The  test  class-balance  p'(y )  can  also  be  estimated 
in  the  same  way  if  a  labeled  test  set  {(cc',,  y',)}”= 1  is  available.  However,  we  are  considering 
a  semi-supervised  learning  setup  where  only  an  unlabeled  test  set  {:e',}™=1  is  available. 
Thus,  p'(y )  cannot  be  estimated  naively. 

In  the  semi-supervised  learning  setup  under  Eq.(26),  p\y)  can  be  estimated  by  fitting 
a  mixture  qn(x)  of  training  class-wise  densities  p(x\y)  to  test  input  density  p'(x)  (see 
Figure  19): 


qn(x)  =  7ip(x\y  =  +1)  +  (1  -  ir)p(x\y  =  -1). 

The  value  of  the  parameter  n  corresponds  to  p\y  =  +1),  whereas  1  —  n  corresponds  to 

p'(y  =  — !)• 
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(a)  Selection  Bias,  SI 


(b)  Subject  Transfer,  SI 


(c)  Selection  Bias,  S2  (d)  Subject  Transfer,  S2 


(e)  Selection  Bias,  S3 


(f)  Subject  Transfer,  S3 


Figure  17:  3D  human-pose  estimation  error  as  a  function  of  the  number  of  training 
samples  averaged  over  all  motions  for  each  subject.  The  best  method  and  comparable 
ones  in  terms  of  the  average  error  according  to  the  paired  t-test  at  the  significance  level 
5%  are  specified  by  ‘o’. 
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(a)  Training  data 


(b)  Test  data 


Figure  18:  Change  in  class  balances  shifts  the  optimal  classification  boundary.  Class- 
conditional  input  density  is  the  same  between  the  training  and  test  phases  (i.e.,  p[x\y)  = 
p'(x\y)),  but  class-prior  probabilities  are  different  (i.e.,  p(y)  ^  p'(y )). 


Figure  19:  p'(y)  can  be  estimated  by  fitting  a  mixture  of  training  class-wise  densities 
p(x\y)  to  test  input  density  p'(x). 

For  the  fitting  of  q7 r  to  p',  we  may  use  the  Kullback-Leibler  ( KL )  divergence  [103]  or 
the  Pearson  (PE)  divergence  [128]: 

KL(p'\\qn)=  f  p'{x)  log  ^-4— yda;, 

J  q-iryX) 

PE(p'\\qn)  =  j  qn(x)  ~  d*- 

These  divergences  can  be  accurately  approximated  from  samples  by  directly  estimating  the 
density  ratio  p'(x)/qn(x)  without  density  estimation  of  p'(x)  and  qn(x)  [168].  However, 
the  density  ratio  function  p'(x)/qn(x)  is  sensitive  to  small  variation,  and  therefore  it  is 
not  robust  against  outliers. 

Here  we  consider  the  L2 -distance  between  p'  and  qy: 

L2(p',qn)  =  j  (p\x)  -  qn{x)^  dx. 
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The  L2-distance  can  also  be  accurately  approximated  from  samples  by  directly  estimating 
the  density  difference  p'(x )  —  g, r(x),  without  density  estimation  of  p'(x)  and  qn(x)  [172], 
Historically,  non-parametric  estimation  of  mixture  proportion  n  under  the  L2-distance 
was  first  investigated  in  [67],  which  uses  empirical  distribution  functions.  Following  this 
seminal  work,  its  variant  based  on  kernel  density  estimation  has  been  developed  [184], 
and  this  is  further  extended  to  choosing  the  kernel  bandwidths  jointly  [68].  In  the  related 
context  of  two-sample  homogeneity  testing  under  the  L2-distance,  the  use  of  kernel  density 
estimators  with  fixed  and  equal  bandwidths  has  been  investigated  [9]. 


4.3.2  L2-Distance  Approximation 

Here,  we  explain  how  the  L2-distance  can  be  directly  approximated  from  data  via  direct 
density-difference  estimation  [96,  172],  For  simplicity,  we  consider  the  approximation 
problem  of  the  L2-distance  between  p  and  p', 

L2(p,p')  =  j  f(x)2dx,  where  f(x)  =  p(x)  —  p'(x),  (27) 

from  {££*}”=!  and  {cc',}”'=1. 

We  use  the  following  Gaussian  density-difference  model: 


where 


(Ci,  .  .  .  ,  Cn  ,  .  .  .  ,  Cn_ Tn' )  (®lj 


,x„,x 


ni  •t'l) 


,  X, 


are  Gaussian  centers.  The  parameter  a  =  (cci, . . . ,  on+n/)T  in  the  density-difference  model 
is  learned  so  that  the  following  criterion  J(cx)  is  minimized: 


J(a)  =  f  (g(x)  -  fix)')  dx 

=  [  gixfdx-2  [  g(x)f(x)dx  +  C, 


where  the  third  term, 

C  =  I  f{x)2dx, 


is  a  constant  irrelevant  to  the  parameter  a.  and  thus  can  be  ignored.  The  first  term  can 
be  computed  analytically  as 


j  g(x)2 dx  =  a.T U ck, 
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(a)  Data  (b)  Density  difference 


Figure  20:  Illustration  of  LSDD.  “x”  in  Figure  20(b)  denotes  an  estimated  density  dif¬ 
ference  value  at  £Ej  and  x\,. 


where  U  is  the  (n  +  n')  x  (n  +  n ')  matrix  with  the  (j,  j')-th  element  defined  by 


uj,f  =  I  exP 
=  (tt  a2)d/2 


\X  —  Ca 


2a2 


exp 


\X  —  Cj'\ 


2a2 


dx 


exp 


I  cj  cj'  I 


4a2 


Approximating  the  expectations  in  the  second  term  by  sample  averages  and  adding  the 
^2-regularizer,  we  have  the  following  training  criterion: 


min 

OL 


a  1  U  (x  —  2  ctTv  +  A  ||  ok  || 2 


where  v  is  the  (n  +  n')-dimensional  vector  with  the  j-tli  element  defined  by 


1=1 


1 

n' 


J^exp 

i'  =  l 


This  training  criterion  is  a  convex  quadratic  function  of  ol  and  its  minimize!’  ot  can  be 
obtained  analytically  as 

a  =  (U  +  A/)-1  v. 


This  method  is  called  the  least-squares  density- difference  (LSDD)  estimator  [172],  Tuning 
parameters  such  as  the  regularization  parameter  A  and  basis  function  ij)  can  be  optimized 
via  cross-validation  with  respect  to  J.  An  example  of  density-difference  estimation  by 
LSDD  is  illustrated  in  Figure  20. 

If  the  true  density-difference  /  in  Eq.(27)  is  replaced  with  the  LSDD  estimator,  we 
obtain  the  following  L2-distance  estimator: 

ctTUa. 
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Similarly,  from  another  expression  of  the  L2-distance  estimator, 

L2(p,p')  =  I  f(x)(p(x)  -p\x)^j dx, 

we  obtain  the  following  L2-distance  estimator: 

vT  at. 

It  was  shown  that  the  linear  combination  of  these  estimators, 

2  vT  a  —  ctTUa, 

tends  to  have  smaller  bias  [172],  and  thus  this  would  be  a  more  reliable  L2-distance 
estimator  in  practice. 

4.3.3  Experiments 

Here,  we  use  four  UCI  benchmark  datasets 11  for  experiments,  where  we  randomly  choose 
10  labeled  training  samples  from  each  class  and  50  unlabeled  test  samples  following  true 
class-prior: 

7T*  =  0.1,  0.2,...,  0.9. 

The  LSDD  method  is  compared  with  the  following  methods: 

KDEi:  Kernel  density  estimation  (KDE)  is  used  to  approximate  p'(x)  and  q7r( x)  from 
data  and  then  the  L2-distance  is  computed  [184],  Two  Gaussian  widths  are  inde¬ 
pendently  chosen  based  on  5- fold  least-squares  cross-validation  [69]. 

KDEj  In  the  KDE-based  method,  two  Gaussian  widths  are  jointly  chosen  based  on  5-fold 
cross-validation  in  terms  of  the  LSDD  criterion  [68].  That  is,  the  cross- validated 
LSDD  criterion  is  computed  as  a  function  of  two  Gaussian  widths  and  the  best  pair 
that  minimizes  the  criterion  is  selected. 

EM:  The  class-prior  estimation  method  based  on  the  expectation-maximization  algo¬ 
rithm  [142],  This  method  actually  corresponds  to  distribution  matching  under  the 
KL  divergence. 

The  left  graphs  in  Figure  21  plot  the  mean  and  standard  error  of  the  squared  difference 
between  true  and  estimated  class-balances  n  over  1000  runs.  These  graphs  show  that 
LSDD  tends  to  provide  better  class-balance  estimates  than  alternative  approaches. 

Next,  we  use  the  estimated  class  balance  to  train  a  classifier.  We  use  a  weighted 
.^-regularized  least-squares  classifier  [138].  That  is,  a  class  label  y  for  a  test  input  x  is 
estimated  by 

V  =  sign  0eK(x,  , 

uhttp : / / archive . ics .uci . edu/ml/ 
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(a)  Australian  dataset 


(b)  Diabetes  dataset 


(c)  German  dataset 
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(d)  Statlogheart  dataset 


Figure  21:  Results  of  class-balance  adaptation.  Left:  Squared  error  of  class-balance  esti¬ 
mation.  Right:  Misclassification  error  by  a  weighted  ^-regularized  least-squares  classifier 
with  weighted  cross-validation. 
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where  K(x,x')  is  the  Gaussian  kernel  function  with  kernel  width  k.  are  learned 

parameters  given  by 


argmin 

8i,...,6n 


Kyi 

Zt  nVi/n 


9iK(xi,  xi)  -  yi 


+ ^  y, 

e= 1 


where  7r+i  =  n,  7T_i  =  1  —  7?,  7?  is  a  class-balance  estimate,  and  S  (>  0)  is  the  regularization 
parameter.  The  Gaussian  width  k  and  the  regularization  parameter  5  are  chosen  by  5-fold 
weighted  cross-validation  [160]  in  terms  of  the  misclassikcation  error. 

The  right  graphs  in  Figure  21  plot  the  test  misclassihcation  error  over  1000  runs.  The 
results  show  the  LSDD-based  method  provides  lower  classification  errors,  which  would  be 
brought  by  good  estimates  of  test  class-balances. 


4.4  Conclusion 

In  this  article,  we  reviewed  semi-supervised  adaptive  learning  techniques  for  the  covariate 
shift  and  class-balance  change  scenarios.  In  both  cases,  importance  weighting  plays  an 
essential  role.  See  [133]  for  more  general  discussion  on  learning  under  different  training 
and  test  distributions. 

If  input-output  samples  are  available  from  both  training  and  test  domains,  weighted 
learning  according  to  the  joint  importance  p'(x,y)/p(x,y)  can  in  principle  be  used  for 
transferring  training  samples  {(aq,  7/*)}f=1  to  the  test  domain  even  when  p(x,y)  and 
p'(x,y )  do  not  have  an  explicit  link  such  as  the  covariate  shift  and  class-balance  change 
[19,  168].  In  this  situation,  not  only  transferring  information  from  the  training  domain 
to  the  test  domain,  but  also  the  opposite  transfer  from  the  test  domain  to  the  training 
domain  is  possible  simultaneously.  This  is  the  idea  of  multi-task  learning  [30]  and  is  also 
an  important  branch  of  modern  machine  learning  research. 

Learning  from  input-output  samples  has  already  been  studied  extensively  in  statis¬ 
tics  and  machine  learning.  However,  collecting  input-output  samples  is  often  expensive 
and  time-consuming  in  practice.  Therefore,  learning  with  side  information  such  as  addi¬ 
tional  input-only  samples  (semi-supervised  learning)  and  additional  related  learning  tasks 
(transfer  learning  and  multi-task  learning),  as  well  as  new  models  of  input-output  data 
collection  such  as  crowdsourcing  [137]  and  self-taught  learning  [134],  will  be  important 
challenges  in  the  arriving  big  data  era. 


5  Information-Maximization  Clustering 

5.1  Introduction 

The  goal  of  clustering  is  to  classify  data  samples  into  disjoint  groups  in  an  unsupervised 
manner.  K-means  [117]  is  a  classic  but  still  popular  clustering  algorithm.  However, 
since  k-means  only  produces  linearly  separated  clusters,  its  usefulness  is  rather  limited  in 
practice. 
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To  cope  with  this  problem,  various  non-linear  clustering  methods  have  been  devel¬ 
oped.  Kernel  k-means  [58]  performs  k-means  in  a  feature  space  induced  by  a  reproducing 
kernel  function  [145].  Spectral  clustering  [149,  123]  first  unfolds  non-linear  data  manifolds 
by  a  spectral  embedding  method,  and  then  performs  k-means  in  the  embedded  space. 
Blurring  mean-shift  [53,  28]  uses  a  non-parametric  kernel  density  estimator  for  model¬ 
ing  the  data-generating  probability  density,  and  finds  clusters  based  on  the  modes  of  the 
estimated  density.  Discriminative  clustering  learns  a  discriminative  classifier  for  separat¬ 
ing  clusters,  where  class  labels  are  also  treated  as  parameters  to  be  optimized  [203,  14]. 
Dependence-maximization  clustering  determines  cluster  assignments  so  that  their  depen¬ 
dence  on  input  data  is  maximized  [153,  50].  See  Section  5.3  for  comprehensive  reviews  of 
existing  clustering  methods. 

These  non-linear  clustering  techniques  would  be  capable  of  handling  highly  complex 
real- world  data.  However,  they  suffer  from  lack  of  objective  model  selection  strategies12. 
More  specifically,  the  above  non-linear  clustering  methods  contain  tuning  parameters 
such  as  the  width  of  Gaussian  functions  and  the  number  of  nearest  neighbors  in  kernel 
functions  or  similarity  measures,  and  these  tuning  parameter  values  need  to  be  manually 
determined  in  an  unsupervised  manner.  The  problem  of  learning  similarities/kernels  was 
addressed  in  earlier  works  [118,  148,  37,  15],  but  they  considered  supervised  setups,  i.e., 
labeled  samples  are  assumed  to  be  given.  [216]  provided  a  useful  unsupervised  heuristic  to 
determine  the  similarity  in  a  data-dependent  way.  However,  it  still  requires  the  number  of 
nearest  neighbors  to  be  determined  manually  (although  the  magic  number  “7”  was  shown 
to  work  well  in  their  experiments). 

Another  line  of  clustering  framework  called  information-maximization  clustering  ex¬ 
hibited  the  state-of-the-art  performance  [1,  60].  In  this  information- maximization  ap¬ 
proach,  probabilistic  classifiers  such  as  a  kernclized  Gaussian  classifier  [1]  and  a  kernel 
logistic  regression  classifier  [60]  are  learned  so  that  mutual  information  (MI)  between  fea¬ 
ture  vectors  and  cluster  assignments  is  maximized  in  an  unsupervised  manner.  A  notable 
advantage  of  this  approach  is  that  classifier  training  is  formulated  as  continuous  opti¬ 
mization  problems,  which  are  substantially  simpler  than  discrete  optimization  of  cluster 
assignments.  Indeed,  classifier  training  can  be  carried  out  in  computationally  efficient 
manners  by  a  gradient  method  [1]  or  a  quasi-Newton  method  [60].  Furthermore,  [1]  pro¬ 
vided  a  model  selection  strategy  based  on  the  information-maximization  principle.  Thus, 
kernel  parameters  can  be  systematically  optimized  in  an  unsupervised  way. 

However,  in  the  above  Mi-based  clustering  approach,  the  optimization  problems  are 
non-convex,  and  finding  a  good  local  optimal  solution  is  not  straightforward  in  practice. 
The  goal  of  this  paper  is  to  overcome  this  problem  by  providing  a  novel  information- 
maximization  clustering  method.  More  specifically,  we  propose  to  employ  a  variant  of 
MI  called  squared-loss  MI  (SMI),  and  develop  a  new  clustering  algorithm  whose  solution 
can  be  computed  analytically  in  a  computationally  efficient  way  via  kernel  eigenvalue  de¬ 
composition.  Furthermore,  for  kernel  parameter  optimization,  we  propose  to  use  a  non- 
parametric  SMI  estimator  called  least-squares  MI  (LSMI)  [179,  157],  which  was  proved 

12  “Model  selection”  in  this  paper  refers  to  the  choice  of  tuning  parameters  in  kernel  functions  or 
similarity  measures,  not  the  choice  of  the  number  of  clusters. 
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to  achieve  the  optimal  convergence  rate  with  an  analytic-form  solution.  Through  exper¬ 
iments  on  various  real-world  datasets  such  as  images,  natural  languages,  accelerometric 
sensors,  and  speeches,  we  demonstrate  the  usefulness  of  the  proposed  clustering  method. 

The  rest  of  this  paper  is  structured  as  follows.  In  Section  5.2,  we  describe  our  proposed 
information-maximization  clustering  method  based  on  SMI  and  analyze  its  properties. 
Then  the  proposed  method  is  compared  with  existing  clustering  methods  qualitatively 
in  Section  5.3  and  quantitatively  in  Section  5.4.  Finally,  this  paper  is  concluded  in  Sec¬ 
tion  5.5. 

5.2  Information-Maximization  Clustering  with  Squared-Loss 
Mutual  Information 

In  this  section,  we  describe  our  proposed  clustering  algorithm. 

5.2.1  Formulation  of  Information-Maximization  Clustering 

Suppose  that  we  are  given  d- dimensional  i.i.d.  feature  vectors  of  size  n, 

{Xi  |  Xi  G  Md]T=1, 

which  are  drawn  independently  from  a  probability  distribution  with  density  p(x).  The 
goal  of  clustering  is  to  give  cluster  assignments, 

{Vi  I  Vi  e  {i,...,c}}"=1, 

to  the  feature  vectors  {ccj}"=1,  where  c  denotes  the  number  of  classes.  Throughout  this 
paper,  we  assume  that  c  is  known. 

In  order  to  solve  the  clustering  problem,  we  take  the  information-maximization  ap¬ 
proach  [1,  60].  That  is,  we  regard  clustering  as  an  unsupervised  classification  problem, 
and  learn  the  class-posterior  probability  p(y\x)  so  that  “information”  between  feature 
vector  x  and  class  label  y  is  maximized. 

The  dependence-maximization  approach  [153,  50]  (see  also  Section  5.3.7)  is  related  to, 
but  substantially  different  from  the  above  information-maximization  approach.  In  the 
dependence-maximization  approach,  cluster  assignments  {y,;}”=1  are  directly  determined 
so  that  their  dependence  on  feature  vectors  {cCj}”=l  is  maximized.  Thus,  the  dependence- 
maximization  approach  intrinsically  involves  combinatorial  optimization  with  respect  to 
{yi}l Li-  On  the  other  hand,  the  information-maximization  approach  involves  contin¬ 
uous  optimization  with  respect  to  the  parameter  a  included  in  a  class-posterior  model 
pe(y\x]  ck).  This  continuous  optimization  of  ol  is  substantially  easier  to  solve  than  discrete 
optimization  of  {y,:}f=i  • 

Another  advantage  of  the  information-maximization  approach  is  that  it  naturally  al¬ 
lows  out-of-sample  clustering  based  on  the  discriminative  model  pg(y |jc;  a),  i.e.,  a  cluster 
assignment  for  a  new  feature  vector  can  be  obtained  based  on  the  learned  discriminative 
model. 
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5.2.2  Squared-Loss  Mutual  Information 

As  an  information  measure,  we  adopt  squared-loss  mutual  information  (SMI).  SMI  be¬ 
tween  feature  vector  x  and  class  label  y  is  defined  by 

where  p(x,y)  denotes  the  joint  density  of  x  and  y ,  and  p(y)  is  the  marginal  probability 
of  y.  SMI  is  the  Pearson  divergence  [128]  from  p(x,y)  to  p(x)p(y),  while  the  ordinary  MI 
[38], 

d*.  (29) 

is  the  Kullback-Leibler  divergence  [103]  from  p(x,y )  to  p(x)p(y).  The  Pearson  divergence 
and  the  Kullback-Leibler  divergence  both  belong  to  the  class  of  Ali-Silvey-Csiszar  diver¬ 
gences  (which  is  also  known  as  f- divergences ,  see  [5,  39]),  and  thus  they  share  similar 
properties.  For  example,  SMI  is  non-negative  and  takes  zero  if  and  only  if  x  and  y  are 
statistically  independent,  as  the  ordinary  MI. 

In  the  existing  information- maximization  clustering  methods  [1,  60]  (see  also  Sec¬ 
tion  5.3.8),  MI  is  used  as  the  information  measure.  On  the  other  hand,  in  this  paper,  we 
adopt  SMI  because  it  allows  us  to  develop  a  clustering  algorithm  whose  solution  can  be 
computed  analytically  in  a  computationally  efficient  way  via  kernel  eigenvalue  decompo¬ 
sition. 


5.2.3  Clustering  by  SMI  Maximization 


Here,  we  give  a  computationally-efficient  clustering  algorithm  based  on  SMI  (28). 
Expanding  the  squared  term  in  Eq.(28),  we  can  express  SMI  as 


SMI 


1 

2 


1 

2 


^p{x)p(y) 

y= i 


f  p{x,y) 
\p(x)p(y) 


pjxpj) 

p{x)p(y) 


dx  + 


1 

2 


5 ~^p{y\x)p{x ) 
y= i 


_  i. 

PM  2 


(30) 


Suppose  that  the  class-prior  probability  p(y)  is  set  to  a  user-specified  value  ny  for  y  = 
1, . . . ,  c,  where  ny  >  0  and  Yll=i 71  y  =  1-  Without  loss  of  generality,  we  assume  that 
{7 Ty}y=i  are  sorted  in  the  ascending  order: 


VTl  <  •  •  •  <  7 Tc- 


If  {ny}y=i  Is  unknown,  we  may  merely  adopt  the  uniform  class-prior  distribution: 

p{y)  =  -  for  y  =  i,- ..,c, 

c 


(31) 
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which  will  be  non- informative  and  thus  allow  us  to  avoid  biasing  clustering  solutions13. 
Substituting  iry  into  p(y),  we  can  express  Eq.(30)  as 


1 

2 


—p(y\x)p{x)p(y\x)dx  - 


(32) 


Let  us  approximate  the  class-posterior  probability  p(y\x)  by  the  following  kernel 
model: 


n 

Pe(y\x;  a)  :  =  ^  ayjiK(x,  a;,),  (33) 

i—1 


where  a  =  (ce^i, . . . ,  ac,n)T  is  the  parameter  vector,  T  denotes  the  transpose,  and  K(x,  x') 
denotes  a  kernel  function  with  a  kernel  parameter  t.  In  the  experiments,  we  will  use  a 
sparse  variant  of  the  local-scaling  kernel  [216]: 


Is.  (ay ,  x  j ) 


I  Xi  Xj  I 

2  V/  (7 j 


if  Xi  £  Mt{xj)  or  Xj  £  Jsft(xi), 
otherwise, 


(34) 


where  J\ft(x)  denotes  the  set  of  t  nearest  neighbors  for  x  (t  is  the  kernel  parameter),  a,  is 
a  local  scaling  factor  defined  as  at  =  \\xi  —  ccf'l||,  and  x'p  is  the  t-tli  nearest  neighbor  of 

OCi. 

Further  approximating  the  expectation  with  respect  to  p(x)  included  in  Eq.(32)  by 
the  empirical  average  of  samples  {ay}”=1,  we  arrive  at  the  following  SMI  approximator: 


SMI 


1 

2  n 


C 


£ 


—  O^K-CXy 

/I  7 / 


1 

2’ 


(35) 


where  oty  :=  (aV)i, . . . ,  ay,n)T  and  Ktj  :=  K(xi,  Xj). 

For  each  cluster  y,  we  maximize  op K2cty  under  llrajl  =  1.  Since  this  is  the  Rayleigh 
quotient,  the  maximizer  is  given  by  the  normalized  principal  eigenvector  of  K  [73].  To 
avoid  all  the  solutions  {ot.y}cy=l  to  be  reduced  to  the  same  principal  eigenvector,  we  impose 
their  mutual  orthogonality:  opayi  =  0  for  y  ^  y'.  Then  the  solutions  are  given  by  the 
normalized  eigenvectors  (pi, <fic  associated  with  the  eigenvalues  Ai  >  •  •  •  >  An  >  0  of 
K.  Since  the  sign  of  (py  is  arbitrary,  we  set  the  sign  as 

4>y  =  (pyX  Sigll(0,[ln), 

where  sign(-)  denotes  the  sign  of  a  scalar  and  ln  denotes  the  n-dimensional  vector  with 
all  ones. 
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Such  a  cluster-balance  constraint  is  often  employed  in  existing  clustering  algorithms  [149,  203,  126]. 
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On  the  other  hand,  since 


f  1  X  A  1 

p(y)  =  /  p(y\x)p(x)dx  ~  -  S^p0(y\xi]cx)  =  -alK ln, 
J  n  n  y 

i=i 


and  the  class-prior  probability  p(y)  was  set  to  7Ty  for  y  =  1, . . . ,  c,  we  have  the  following 
normalization  condition: 


1 

n 


ctTyK  1 


=  71 


y 


Furthermore,  probability  estimates  should  be  non-negative,  which  can  be  achieved  by 
rounding  up  negative  outputs  to  zero. 

Taking  these  normalization  and  non-negativity  issues  into  account,  cluster  assignment 
y,  for  X{  is  determined  as  the  maximizer  of  the  approximation  of  p(y\x.i)\ 


[max(On,  K<py)\i 

yi  =  argmax - - 

y  (mry)_1  max(On,  K(f)y)Tln 


argmax 

y 


7ry[max(0n,  (j>y)\i 
max(On,  4>y)Tln 


where  0ra  denotes  the  n-dimensional  vector  with  all  zeros,  the  max  operation  for  vectors 
is  applied  in  the  element-wise  manner,  and  [•]*  denotes  the  i-th  element  of  a  vector.  Note 
that  we  used  K<py  =  X y(f>y  in  the  above  derivation.  For  out-of-sarnplc  prediction,  cluster 
assignment  y'  for  new  sample  x'  may  be  obtained  as 


ny  max  ( 0,  Y%=i  K (*',  *») [4>y]i) 

y'  :=  argmax  - - - ^ - — 

y  Xy  max(On,  <f)y)Tln 

We  call  the  above  method  SMI-based  clustering  (SMIC). 


Discussions:  Given  an  SMI  approximator  SMI  defined  by  Eq.(35),  a  natural  optimiza¬ 
tion  criterion  would  be  to  impose  non-negativity  and  normalization  constraints  on  the 
parameter  at.  However,  this  results  in  a  non-co nvex  optimization  problem  and  it  is  not 
straightforward  to  obtain  the  global  optimal  solution  or  even  a  good  local  solution  without 
any  prior  knowledge.  For  this  reason,  we  decided  to  introduce  the  unit-norm  constraint 
||o:y||  =  1  on  the  parameter,  which  allows  us  to  obtain  the  global  optimal  solution  analyti¬ 
cally  even  though  the  optimization  problem  is  still  non-convex.  Although  the  introduction 
of  the  unit-norm  constraint  is  a  heuristic,  this  formulation  has  an  advantage  that  we  do 
not  have  to  specify  a  good  initial  solution  and  it  will  be  shown  to  work  well  in  experiments 
in  Section  5.4. 


5.2.4  Kernel  Parameter  Choice  by  SMI  Maximization 

The  solution  of  SMIC  depends  on  the  choice  of  the  kernel  parameter  t  included  in  the 
kernel  function  K(x,x').  Since  SMIC  was  developed  in  the  framework  of  SMI  maxi¬ 
mization,  it  would  be  natural  to  determine  the  kernel  parameter  t  so  as  to  maximize 
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SMI.  A  direct  approach  is  to  use  the  SMI  estimator  SMI  given  by  Eq.(35)  also  for  kernel 
parameter  choice.  However,  this  direct  approach  is  not  favorable  because  SMI  is  an  unsu¬ 
pervised  SMI  estimator  (i.e.,  SMI  is  estimated  only  from  unlabeled  samples  {cc*}”=1).  On 
the  other  hand,  in  the  model  selection  stage,  we  have  already  obtained  labeled  samples 
{(a;*,  ?/i)}”=1,  and  thus  supervised  estimation  of  SMI  is  possible.  For  supervised  SMI  esti¬ 
mation,  a  non-parametric  SMI  estimator  called  least-squares  mutual  information  (LSMI) 
[179]  was  shown  to  achieve  the  optimal  convergence  rate.  For  this  reason,  we  propose  to 
use  LSMI  for  model  selection,  instead  of  SMI  (35). 

LSMI  is  an  estimator  of  SMI  based  on  paired  samples  {(cc*,  ?/j)}"=1.  The  key  idea  of 
LSMI  is  to  learn  the  following  density-ratio  function  [168], 


r{x,y) 


p(x>  y) 

p(x)p(y)' 


(36) 


without  going  through  density  estimation  of  p(x,y ),  p(x),  and  p(y).  More  specifically,  let 
us  employ  the  following  density-ratio  model: 


re{x,  y,  6)  ^2  0iL(x,xt), 

kyt=y 


(37) 


where  6  =  (#!,...,  9n)T  and  L(x,  x')  is  a  kernel  function  with  a  kernel  parameter  7.  In 
the  experiments,  we  will  use  the  Gaussian  kernel: 

L{x,  x')  =  exp  ~  j  ,  (38) 

where  the  Gaussian  width  7  is  the  kernel  parameter. 

The  parameter  9  in  the  above  density-ratio  model  is  learned  so  that  the  following 
squared  error  is  minimized: 


1 

min  - 
e  2 


2 

p(x)p(y)  dx. 


(39) 


Let  6y  be  the  parameter  vector  corresponding  to  the  kernel  bases  {L(x,  xf)}tyt=y,  i.e., 
By  is  the  sub-vector  of  6  =  (9 1, . . . ,  9n)T  consisting  of  indices  {i  \  ye  —  y}.  Let  ny  be  the 
length  of  Gy ,  i.e.,  the  number  of  samples  in  cluster  y.  Then  an  empirical  and  regularized 
version  of  the  optimization  problem  (39)  is  given  for  each  y  as  follows: 


min 

Oy 


(40) 


where  5  (>  0)  is  the  regularization  parameter.  is  the  ny  x  ny  matrix  and  is  the 

ny- dimensional  vector  defined  as 


^2  x?)L{Xi->  X?)i 

1=1 

i-yi=y 
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where  is  the  £- th  sample  in  class  y  (which  corresponds  to  9 

A  notable  advantage  of  LSMI  is  that  the  solution  9^  can  be  computed  analytically 
as 

Q(y)  =  (H^  +  5 1)-1  h^. 

Then  a  density-ratio  estimator  is  obtained  analytically  as  follows14: 

ny 

r(x,y)  =  6tf'>  L(x,  xf^). 

i= i 

The  accuracy  of  the  above  least-squares  density-ratio  estimator  depends  on  the  choice 
of  the  kernel  parameter  7  included  in  L(x,x')  and  the  regularization  parameter  5  in 
Eq.(40).  [179]  showed  that  these  tuning  parameter  values  can  be  systematically  optimized 
based  on  cross-validation  as  follows:  First,  the  samples  Z  =  {(37, 2/i)}”=1  are  divided  into 
M  disjoint  subsets  {Zm}^=1  of  approximately  the  same  size  (we  use  M  —  5  in  the 
experiments).  Then  a  density-ratio  estimator  r^x^y)  is  obtained  using  Z\Zm  (i.e.,  all 
samples  without  Zm),  and  its  out-of-sample  error  (which  corresponds  to  Eq.(39)  without 
irrelevant  constant)  for  the  hold-out  samples  Zrn  is  computed  as 

CVTO  :=  *  Y  rm(x,y)2--^—  Y  rm{x,y), 

1  m|  x,yezm  1  (x,y)ezm 

where  ]T)x  y&z  denotes  the  summation  over  all  combinations  of  x  and  y  in  Zm  (and  thus 
\Zm\2  terms),  while  Y^(xy)ezm  denotes  the  summation  over  all  pairs  ( x,y )  in  Zm  (and 
thus  Zrn |  terms).  This  procedure  is  repeated  for  rri  —  1, . . . ,  M,  and  the  average  of  the 
above  hold-out  error  over  all  m  is  computed  as 

1  M 

CV  :=  —  V  CVm. 

M  ^ 

m= 1 

Then  the  kernel  parameter  7  and  the  regularization  parameter  5  that  minimize  the  average 
hold-out  error,  CV,  are  chosen  as  the  most  suitable  ones. 

Finally,  based  on  an  expression  of  SMI  (28), 


14Note  that,  in  the  original  LSMI  paper  [179],  the  entire  parameter  G  =  (0\, . . .  ,0n)T  for  all  classes 
was  optimized  at  once.  On  the  other  hand,  we  found  that,  when  the  density-ratio  model  rg(x,y;0) 
defined  by  Eq.(37)  is  used  for  SMI  approximation,  exactly  the  same  solution  as  the  original  LSMI  paper 
can  be  computed  more  efficiently  by  class-wise  optimization.  Indeed,  in  our  preliminary  experiments,  we 
confirmed  that  our  class- wise  optimization  significantly  reduces  the  computation  time  compared  with  the 
original  all-class  optimization,  with  the  same  solution.  Note  that  the  original  LSMI  is  applicable  to  more 
general  setups  such  as  regression,  multi-label  classification,  and  structured-output  prediction.  Thus,  our 
speedup  was  brought  by  focusing  on  classification  scenarios  where  Kronecker’s  delta  function  is  used  as 
the  kernel  for  class  labels  in  the  density-ratio  model  (37). 
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j;  *  1 


{K}?=i 


Figure  22:  Schematic  of  the  proposed  clustering  algorithm.  We  prepare  T  kernel  candi¬ 
dates  {Kt(x,  x')}J=1,  compute  cluster  assignments  {y^}™± ^t=1  by  SMIC,  and  choose  the 
best  one  that  maximizes  LSMI. 

Input:  Feature  vectors  X  =  { xt }  "= ,  and  the  number  c  of  clusters 
Output:  Cluster  assignments  y  =  {;//,}■ '=1 

For  each  kernel  parameter  candidate  t  e  T 
y(t:>  i —  SMIC(A,  £,  c); 

LSMI(t)  4 —  LSMI(^,yW); 

end 

t  < —  argmax  LSMI(f); 

teT 

y  <—  y(?); 

Figure  23:  Pseudo  code  of  information-maximization  clustering  based  on  SMIC  and  LSMI. 
The  kernel  parameter  t  refers  to  the  tuning  parameter  included  in  the  kernel  function 
K( x,x')  in  the  cluster-posterior  model  (33).  Pseudo  codes  of  SMIC  and  LSMI  are  de¬ 
scribed  in  Figure  24  and  Figure  25,  respectively. 


the  SMI  estimator  called  LSMI  is  given  as  follows: 

n  n 

LSMI  :=  +  “  2’  (41) 
i,j= 1  i= 1 

where  r^x^y)  is  a  density-ratio  estimator  obtained  above.  Since  r(x,y)  can  be  computed 
analytically,  LSMI  can  also  be  computed  analytically. 

We  use  LSMI  for  model  selection  of  SMIC.  More  specifically,  we  compute  LSMI  as  a 
function  of  the  kernel  parameter  t  of  K(x,  x')  included  in  the  cluster-posterior  model  (33), 
and  choose  the  one  that  maximizes  LSMI.  See  Figure  22  for  a  schematic.  A  pseudo  code 
of  the  entire  SMI-maximization  clustering  procedure  is  summarized  in  Figures  23-25. 

Discussions:  SMI  given  by  Eq.(35)  is  used  for  determining  cluster  assignments  {y?;}”=1, 
while  LSMI  is  used  for  model  selection.  Since  LSMI  was  shown  to  be  the  optimal  ap¬ 
proximator  of  SMI,  it  would  be  more  natural  to  use  LSMI  also  for  determining  cluster 
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Input:  Feature  vectors  X  =  {cc*}f=1,  kernel  parameter  t, 
and  the  number  c  of  clusters 
Output:  Cluster  assignments  y  =  {y»}"=1 


K  i —  Kernel  matrix  for  samples  X  and  kernel  parameter  t: 
<py  i —  y-th  principal  eigenvectors  of  K  for  y  —  1, . . . ,  c; 

(j)y  < —  4>y  x  sign ln)  for  y  =  1, , . . ,  c; 


max(On,  0y)]i 


y,  < —  argmax 

ye{i,...,c}  max(On,  <j>y] 

y  «—  {y*}f=1; 


T  i 


for  i  =  1, . 


Figure  24:  Psendo  code  of  SMIC  (with  the  uniform  class-prior  distribution).  The  kernel 
parameter  t  refers  to  the  tuning  parameter  included  in  the  kernel  function  K(x,  x')  in  the 
cluster-posterior  model  (33).  If  the  class-prior  probability  p(y)  is  set  to  a  user-specified 
value  Tty  for  y  =  1, . . . ,  c,  y,:  is  determined  as  argmax  7rdmax(^'<M»  _ 

*  max(0 1  n 


Input:  Feature  vectors  X  =  {cCj}"=1  and  cluster  assignments  y  =  {y,}”=1 
Output:  LSMI  (an  SMI  estimate) 

z  < —  {(®i,y*)}jLi; 

{Zm}m=i  t —  M  disjoint  subsets  of  Z ; 

For  each  kernel  parameter  candidate  7  G  F 

For  each  regularization  parameter  candidate  5  G  A 
For  each  fold  m  —  1 , ,M 

rltgtm(x,y)  < —  Density  ratio  estimator  for  (7,  5)  using  Z\Zm\ 

CVm( ,y,5)  < —  Hold-out  error  of  r7^m(x,y)  for  Zm\ 

end 

1  M 

CV(7,(5)  4 —  —  ^2  CVm(7,<5); 

ra=l 

end 

end 

(7,5)  t —  argmin  C\’(~ .  d): 

7er,<5eA 

r(x,y )  t —  Density  ratio  estimator  for  (7,  5)  using  Z; 

1  n  1  n  1 

LSMI  4 - ^  ^  -  2’ 

Figure  25:  Pseudo  code  of  LSMI.  The  kernel  parameter  7  refers  to  the  tuning  parameter 
included  in  the  kernel  function  L(x,x')  in  the  density-ratio  model  (37). 
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assignments  in  a  dependence-maximizing  way  [153,  50].  However,  this  is  not  practical 
because  maximizing  LSMI  with  respect  to  cluster  assignments  is  a  hard  optimiza¬ 

tion  problem  and  a  naive  greedy-search  strategy  may  not  give  a  good  solution  without 
any  prior  knowledge.  For  this  reason,  we  decided  to  use  different  criteria,  SMI  and  LSMI, 
for  determining  cluster  assignments  and  model  selection.  In  principle,  it  is  possible  to  use 
an  arbitrary  clustering  algorithm  in  the  first  step  and  then  evaluate  its  validity  by  LSMI 
in  the  second  stage,  although  SMI  and  LSMI  are  “consistent”  in  the  sense  that  they  are 
both  approximators  of  SMI. 

5.2.5  Perturbation  Stability  Analysis 

Here,  we  analyze  the  perturbation  stability  of  the  proposed  clustering  algorithm. 

Let  us  denote  the  set  of  symmetric  matrices  of  size  n  by  §n  C  Mnxn,  and  the  Frobenius 
norm  of  a  matrix  by  ||  •  1 1 Frob •  For  A  e  Sn,  we  denote  by  A(A)  the  spectra  of  A,  i.e.,  the 
set  of  all  eigenvalues  of  A.  For  e  >  0,  a  subset  A(A)  of  A(A)  is  said  to  be  an  e-cluster  of 
(the  spectra  of)  A,  if  the  following  two  conditions  are  met: 

1.  A(A)  has  a  diameter  smaller  than  e. 

2.  dy(A(A),  A(A)  \  A(A))  >  e,  where  dy  is  the  Hausdorff  distance. 

First,  we  review  a  fundamental  perturbation  result  given  in  the  appendix  of  [99], 
Lemma  5.2  of  [100],  and  pp. 33-34  in  [197]. 

Proposition  1  (Finite-dimensional  perturbation).  For  A  e  §n,  let  pi  >  ■■■  >  p^  be 
the  eigenvalues  of  A  counted  without  multiplicity,  and  W\ , . . . ,  H4  be  the  corresponding 
eigenspaces.  Let  Pj(A )  be  the  orthogonal  projection  onto  Wj  for  j  =  1 , ,k.  For  1  < 
r  <  k,  define  the  eigengap 

5r  :=  min  {p~  —  m+1}. 

3=1,-, r 

Fix  r,  let  0  <  e  <  5r/ 4,  and  assume  perturbation  with  1 1  -B 1 1 FVcb  <  e.  Then, 

1.  The  spectra  A  (A  +  B)  of  ( A  +  B)  can  be  partitioned  into  r  +  1  subsets,  i.e.,  r 
e-clusters  A j(A  +  B)  for  j  =  1, ... ,r  and  the  residue  Rr  satisfy 

Aj(A  +  B)cB(W,e),  (42) 

where  denotes  the  open  ball  with  center  pj  and  radius  e,  and 

d'H.i.Pr,  {hi j  •  •  •  j  hr} )  ^  dr  6. 

2.  Denote  by  Pj(A+B )  the  orthogonal  projection  onto  the  direct  sum  of  the  eigenspaces 
of  (A  +  B)  with  eigenvalues  in  the  cluster  A j(A  +  B)  for  j  =  1, . . . ,  k.  For  all 
j  =  1, ...  ,  r,  we  have 

tr  (Pj(A  +  B))  =  tr(  P,(A))  (43) 

and 

||Pi(A  +  B)-Pi(A)||  Prob  A  4 1 1  T$  1 1  Frob  /^r  •  (44) 
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Intuitively  speaking,  Eq.(42)  says  that  the  perturbed  eigenvalues  are  close  to  the 
original  eigenvalues,  Eq.(44)  says  that  the  perturbed  eigenspaces  are  close  to  the  original 
eigenspaces,  and  Eq.  (43)  guarantees  the  same  dimensionality  of  the  eigenspaces  and  thus 
the  same  multiplicity  of  perturbed  and  original  eigenvalues,  provided  that  the  eigenvalues 
of  A  are  well-separated,  i.e.,  the  eigengap  5r  is  more  than  4 1 1  1 1  Frob - 

Now  we  apply  the  above  result  to  SMIC.  Recall  that  SMIC  maximizes  the  objective 
function  defined  in  Eq.(35), 


1 

2  n 


C 


£ 


-alK2ay 

/l  7 1 


1 

2’ 


subject  to  the  orthonormality  of  {aq, . . . ,  ay}.  We  can  bound  the  difference  between 
empirical  and  optimal  solutions  under  a  kernel  matrix  perturbation  A  e  §n  with 
|  A  ||  prob  |  K 1 1  Frob  aS  follows: 


Theorem  2  (Kernel  matrix  perturbation).  Suppose  that  the  kernel  function  satisfies 
K(x,x)  =  1  for  all  x.  Let  fi\  >  ■  ■  ■  >  pr  be  the  first  r  eigenvalues  of  the  kernel  ma¬ 
trix  K  counted  without  multiplicity,  such  that  pr  is  the  c-th  largest  eigenvalue  of  K  if 
counted  with  multiplicity.  Define  the  eigengap 


5r  =  min  {/ij  —  Pj+i}. 

j=l....,r 

Assume  that  the  kernel  matrix  K  is  perturbed  as 

K'  =  K  +  A, 


where  A  e  §n  with  1 1 A 1 1 Frob  <  drj\.  Denote  by  v  and  {(f) i, . . . ,  <f>c}  the  optimal  value  and 
solutions  of  SMIC  for  K ,  and  by  v'  the  optimal  value  of  SMIC  for  K' .  Then  we  have 

\v  -  v'\  <  ||  A||Frob/7Ti,  (45) 

and  there  exist  optimal  solutions  {(/)[, . . . ,  (/){}  for  K'  such  that 

II <t>y  -  <t>y\\ 2  <  4 1| A || Frob/ Sr  for  y  =  1,. . . ,  c,  (46) 

where  ||  •  H2  denotes  the  dz-norm. 

A  proof  of  Theorem  2  is  provided  in  Appendix  5.6.  This  theorem  shows  that  the 
difference  in  SMIC  solutions  is  bounded  by  the  amount  of  perturbation  in  the  kernel 
matrix,  which  is  a  desirable  property  in  practice.  Note  that,  by  “there  exist  optimal 
solutions  {<()[,...,  4>'c}” ,  we  mean  that  {(/)[,...,(/){}  need  to  be  chosen  carefully,  since 
SMIC  involves  non-convex  optimization  and  thus  there  may  exist  multiple  globally  optimal 
solutions.  However,  if  K  has  c  distinct  top  eigenvalues  which  would  be  a  usual  case  in 
practice,  it  will  be  easy  to  determine  (p’y  because  the  only  degree  of  freedom  is  its  sign. 

Next,  we  analyze  the  post-processing  step  of  SMIC. 
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Theorem  3  (Post-processing  perturbation).  Under  the  same  assumption  as  Theorem  2, 
suppose  that  { 4>\ , ....  </>(,}  satisfy  Eq.( 46).  Without  loss  of  generality,  we  further  assume 
that 

lZ<t>y  >  0  and  lZfiy  >  Ofory= 

Define  the  soft  response  vectors  based  on  the  solutions  {<f> i, . . . ,  </>c}  and  {<£>[, ....  </>(,}  as 

fv  =  ^v4>yl{^Z4>v)  and  fy  =  Ky<t>'y  /{iZ&y)  fory  = 

respectively,  where  <py  =  max(On,  (f>y)  and  <f>'y  =  max(On,  4>'y).  Then,  for  y  —  1, ...  ,c,  we 
have 

II  fy  -  f'yh/yfn  <  wV2-Ky\\A.\\Fmh/5r. 

A  proof  of  Theorem  3  is  provided  in  Appendix  5.7.  This  theorem  shows  that  SMIC  is 
stable  with  respect  to  kernel  matrix  perturbation  A.  That  is,  the  root-mean-square  error 
||  fy  —  f'y  1 1 2 / \JTi  will  vanish  as  n  — >  oo ,  if  the  intensity  of  the  perturbation  measured  by 
1 1 A 1 1 Frob / dr  is  asymptotically  an  inhnitesimal,  i.e.,  ||A||prob/^r  £  o(l)  in  terms  of  n. 

5.3  Existing  Clustering  Methods 

In  this  section,  we  review  existing  clustering  methods  and  qualitatively  discuss  the  relation 
to  the  proposed  approach. 

5.3.1  K-Means  Clustering 

K-means  clustering  [117]  would  be  one  of  the  most  popular  clustering  algorithms.  It 
tries  to  minimize  the  following  distortion  measure  with  respect  to  the  cluster  assignments 


Yi  Y  iix<  -  M2-  (47) 

y=  i  i:Vi=y 

where  pLy  :=  Yhi-yi=y  xi  is  centroid  of  cluster  y  and  ny  is  the  number  of  samples  in 
cluster  y. 

The  original  k-means  algorithm  is  capable  of  only  producing  linearly  separated  clusters 
[46].  However,  since  samples  are  used  only  in  terms  of  their  inner  products,  its  non-linear 
variant  can  be  immediately  obtained  by  performing  k-means  in  a  feature  space  induced 
by  a  reproducing  kernel  function  [58]. 

As  the  optimization  problem  of  (kernel)  k-means  is  NP-hard  [6],  a  greedy  optimization 
algorithm  is  usually  used  for  finding  a  local  optimal  solution  in  practice.  It  was  shown 
that  the  solution  to  a  continuously-relaxed  variant  of  the  kernel  k-means  problem  is  given 
by  the  principal  components  of  the  kernel  matrix  [217,  44],  Thus,  post-discretization  of 
the  relaxed  solution  may  give  a  good  approximation  to  the  original  problem,  which  is 
computationally  efficient.  This  idea  is  similar  to  the  proposed  SMIC  method  described 
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in  Section  5.2.3.  However,  an  essential  difference  is  that  SMIC  handles  the  continuous 
solution  directly  as  a  parameter  estimate  of  the  class-posterior  model. 

The  performance  of  kernel  k-means  depends  heavily  on  the  choice  of  kernel  functions, 
and  there  is  no  systematic  way  to  determine  the  kernel  function.  This  is  a  critical  weakness 
of  kernel  k-means  in  practice.  On  the  other  hand,  our  proposed  approach  offers  a  natural 
model  selection  strategy,  which  is  a  significant  advantage  over  kernel  k-means. 


5.3.2  Spectral  Clustering 

The  basic  idea  of  spectral  clustering  [149,  123]  is  to  first  unfold  non-linear  data  manifolds 
by  a  spectral  embedding  method,  and  then  perform  k-means  in  the  embedded  space.  More 
specifically,  given  sample-sample  similarity  >  0  (large  Wtj  means  that  Xi  and  Xj  are 
similar),  embedded  samples  are  obtained  as  the  minimizer  of  the  following  criterion  with 
respect  to  {£i}[l=i  under  some  normalization  constraint: 


n 


£  w‘j 


where  D  is  the  diagonal  matrix  with  i-th  diagonal  element  given  by  Dhl  :=  i  Hhj  • 
Consequently,  the  embedded  samples  are  given  by  the  principal  eigenvectors  of 
D~2\VD~2,  followed  by  normalization.  Note  that  spectral  clustering  was  shown  to 
be  equivalent  to  a  weighted  variant  of  kernel  k-means  with  some  specific  kernel  [43]. 

The  performance  of  spectral  clustering  depends  heavily  on  the  choice  of  sample-sample 
similarity  Wij.  [216]  proposed  a  useful  unsupervised  heuristic  to  determine  the  similarity 
in  a  data-dependent  manner,  called  local  scaling: 


Wij  =  exp 


where  cq  is  a  local  scaling  factor  defined  as 


&i  = 


Xi 


and  x:p  is  the  t-th  nearest  neighbor  of  X{.  t  is  the  tuning  parameter  in  the  local  scaling 
similarity,  and  t  =  7  was  shown  to  be  useful  [216,  156].  However,  this  magic  number  “7” 
does  not  seem  to  work  always  well  in  general. 

If  D~2\VD  '2  is  regarded  as  a  kernel  matrix,  spectral  clustering  will  be  similar  to 
the  proposed  SMIC  method  described  in  Section  5.2.3.  However,  SMIC  does  not  require 
the  post  k-means  processing  since  the  principal  components  have  clear  interpretation 
as  parameter  estimates  of  the  class-posterior  model  (33).  Furthermore,  our  proposed 
approach  provides  a  systematic  model  selection  strategy,  which  is  a  notable  advantage 
over  spectral  clustering. 
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5.3.3  Blurring  Mean-Shift  Clustering 

Blurring  mean-shift  [53]  is  a  non-parametric  clustering  method  based  on  the  modes  of  the 
data-generating  probability  density. 

In  the  blurring  mean-shift  algorithm,  a  kernel  density  estimator  [152]  is  used  for  mod¬ 
eling  the  data-generating  probability  density: 

1  n 

p(x)  =  -  K  (n* -  **ii2/^2)  , 

n  z ' 

i=l 

where  K(£)  is  a  kernel  function  such  as  a  Gaussian  kernel  K(£)  =  e~ ^2.  Taking  the 
derivative  of  p(x)  with  respect  to  x  and  equating  the  derivative  at  x  =  xt  to  zero,  we 
obtain  the  following  updating  formula  for  sample  Xi  {i  —  1, . . . ,  n): 

Y2j=i  WijXj 

Xi  EJ=i  Wij.  ’ 

where  WltJ  :=  K’  (||a;j  —  Xj\\2 /a2)  and  K'(f)  is  the  derivative  of  K(£).  Each  mode  of  the 
density  is  regarded  as  a  representative  of  a  cluster,  and  each  data  point  is  assigned  to  the 
cluster  which  it  converges  to. 

[29]  showed  that  the  blurring  mean-shift  algorithm  can  be  interpreted  as  an 
expectation-maximization  algorithm  [41],  where  Wij/(^”,=1  W^y)  is  regarded  as  the  pos¬ 
terior  probability  of  the  i-th  sample  belonging  to  the  j-th  cluster.  Furthermore,  the  above 
update  rule  can  be  expressed  in  a  matrix  form  as  X  < —  XP,  where  X  =  (crq, . . . ,  xn )  is 
a  sample  matrix  and  P  :=  WD  1  is  a  stochastic  matrix  of  the  random  walk  in  a  graph 
with  adjacency  W  [35].  D  is  defined  as  Di%  :=  Y^=i^i,j  and  =  0  for  i  ^  j.  If 
P  is  independent  of  X ,  the  above  iterative  algorithm  corresponds  to  the  power  method 
[59]  for  finding  the  leading  left  eigenvector  of  P.  Then,  this  algorithm  is  highly  related 
to  the  spectral  clustering  which  computes  the  principal  eigenvectors  of  D  (see 

Section  5.3.2).  Although  P  depends  on  X  in  reality,  [28]  insisted  that  this  analysis  is  still 
valid  since  P  and  X  quickly  reach  a  quasi-stable  state. 

An  attractive  property  of  blurring  mean-shift  is  that  the  number  of  clusters  is  automat¬ 
ically  determined  as  the  number  of  modes  in  the  probability  density  estimate.  However, 
this  choice  depends  on  the  kernel  parameter  o  and  there  is  no  systematic  way  to  determine 
a,  which  is  restrictive  compared  with  the  proposed  method.  Another  critical  drawback  of 
the  blurring  mean-shift  algorithm  is  that  it  eventually  converges  to  a  single  point  (i.e.,  a 
single  cluster,  see  [34]),  and  therefore  a  sensible  stopping  criterion  is  necessary  in  practice. 
Although  [28]  gave  a  useful  heuristic  for  stopping  the  iteration,  it  is  not  clear  whether 
this  heuristic  always  works  well  in  practice. 

5.3.4  Discriminative  Clustering 

The  support  vector  machine  (SVM)  [193]  is  a  supervised  discriminative  classifier  that  tries 
to  find  a  hyperplane  separating  positive  and  negative  samples  with  the  maximum  margin. 
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[203]  extended  SVM  to  unsupervised  classification  scenarios  (i.e. ,  clustering),  which  is 
called  maximum-margin  clustering  (MMC). 

MMC  inherits  the  idea  of  SVM  and  tries  to  find  the  cluster  assignments  y  = 
(yi, . . . ,  yn)T  so  that  the  margin  between  two  clusters  is  maximized  under  proper  con¬ 
straints: 


min  max  2ATln  —  (K  o  XX,yyT) 
ye{+i  ,-i}"  a 

subject  to  —  £  <  1  Jt  y  <  e  and  0n  <  A  <  C ln, 

where  o  denotes  the  Hadamard  product  (also  known  as  the  entry-wise  product),  and  £ 
and  C  are  tuning  parameters.  The  constraint  — £  <  l^y  <  £  corresponds  to  balancing 
the  cluster  size. 

Since  the  above  optimization  problem  is  combinatorial  with  respect  to  y  and  thus 
hard  to  solve  directly,  it  is  relaxed  to  a  semi-definite  program  by  replacing  yyT  (which 
is  a  zero-one  matrix  with  rank  one)  with  a  real  positive  semi-definite  matrix  [203] .  Since 
then,  several  approaches  have  been  developed  for  further  improving  the  computational 
efficiency  of  MMC  [191,  220,  219,  108,  199]. 

The  performance  of  MMC  depends  heavily  on  the  choice  of  the  tuning  parameters  £ 
and  C,  but  there  is  no  systematic  method  to  tune  these  parameters.  The  fact  that  our 
proposed  approach  is  equipped  with  a  model  selection  strategy  would  practically  be  a 
strong  advantage  over  MMC. 

Following  a  similar  line  to  MMC,  a  discriminative  and  flexible  framework  for  clustering 
(DIFFRAC)  [14]  was  proposed.  DIFFRAC  tries  to  solve  a  regularized  least-squares  prob¬ 
lem  with  respect  to  a  linear  predictor  and  class  labels.  Thanks  to  the  simple  least-squares 
formulation,  the  parameters  in  the  linear  predictor  can  be  optimized  analytically,  and 
thus  the  optimization  problem  is  much  simplified.  A  kernelized  version  of  the  DIFFRAC 
optimization  problem  is  given  by 

min  tr(nnTKr(rR:r  +  nnln)  1T). 

ye{+l,-l}» 

where  II  is  the  n  x  c  cluster  indicator  matrix,  which  takes  1  only  at  one  of  the  elements 
in  each  row  (this  corresponds  to  the  index  of  the  cluster  to  which  the  sample  belongs) 
and  others  are  all  zeros,  k  (>  0)  is  the  regularization  parameter,  and  T  :=  In  —  ^lnl][ 
is  a  centering  matrix.  In  practice,  the  above  optimization  problem  is  relaxed  to  a  semi- 
definite  program  by  replacing  IHIT  with  a  real  positive  semi-definite  matrix.  However, 
DIFFRAC  is  still  computationally  expensive  and  it  suffers  from  lack  of  objective  model 
selection  strategies. 

5.3.5  Generative  Clustering 

In  the  generative  clustering  framework  [46],  class  labels  are  determined  by 

y  =  argmax  p(y\x)  =  argmax  p(x,y), 
y  y 
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where  p(y\x)  is  the  class-posterior  probability  and  p(x,y)  is  the  data-generating  proba¬ 
bility.  Typically,  p(x,y)  is  modeled  as 

pe(x,  y:  / 3 ,  tt)  =  pe(x\y;  (3)p0{y\  tt), 

where  (3  and  tv  are  parameters.  Canonical  model  choice  is  the  Gaussian  distribution  for 
pg(x\y;  (3)  and  the  multinomial  distribution  for  p0(y ;  tv). 

However,  since  class  labels  {yiY(=\  are  unknown,  one  may  not  directly  learn  (3  and  tv 
in  the  joint-probability  model  pe(x,  y,  (3,  tv).  An  approach  to  coping  with  this  problem  is 
to  consider  a  marginal  model, 


Po(x;f3,  tv)  =  ^2pe(x\y;f3)pe(y\7T), 
y= i 

and  learns  the  parameters  (3  and  tv  by  maximum  likelihood  estimation  [46]: 

n 

max  TT po(xi]  (3,  tv). 

(3, 7T  -1"  -L- 

1=1 

Since  the  likelihood  function  of  the  above  mixture  model  is  non-convex,  a  gradient  method 
[7]  may  be  used  for  finding  a  local  maximizer  in  practice.  For  determining  the  number  of 
clusters  (mixtures)  and  the  mixing-element  model  po(x\y,(3),  likelihood  cross-validation 
[69]  may  be  used. 

Another  approach  to  coping  with  the  unavailability  of  class  labels  is  to  regard  {yi}™=  i  as 
latent  variables ,  and  apply  the  expectation-maximization  (EM)  algorithm  [41]  for  finding 
a  local  maximizer  of  the  joint  likelihood: 

n 

max  TT pg(xi,  yp  (3,  tv). 

(3,7T  -*■ 

i=l 

A  more  flexible  variant  of  the  EM  algorithm  called  the  split- and-merge  EM  algorithm 
[189]  is  also  available,  which  dynamically  controls  the  number  of  clusters  during  the  EM 
iteration. 

Instead  of  point-estimating  the  parameters  (3  and  tv,  one  can  also  consider  their  dis¬ 
tributions  in  the  Bayesian  framework  [24],  Let  us  introduce  prior  distributions  po((3)  and 
Pe( tv)  for  the  parameters  (3  and  tv.  Then  the  posterior  distribution  of  the  parameters  is 
expressed  as 


Pe{(3,  tv\X)  (xpo(X\(3,Tv)pe((3)po(Tv), 
where  X  =  {a3j}”=1.  Based  on  the  Bayesian  predictive  distribution , 

p(y\x,X)(x  [[ pe(x,y\(3,Tv)pg(f3,Tv\X)df3dTv, 
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class  labels  are  determined  as 


max p(y\x,  X). 
y 

Because  the  integration  included  in  the  Bayesian  predictive  distribution  is  computa¬ 
tionally  expensive,  conjugate  priors  are  often  adopted  in  practice.  For  example,  for  the 
Gaussian-cluster  model  pg(x\y;  (3 ),  the  Gaussian  prior  is  assumed  for  the  mean  parameter 
and  the  Wishart  prior  is  assumed  for  the  precision  parameter  (i.e.,  the  inverse  covariance) 
for  the  multinomial  model  pg(y;  7r),  the  Dirichlct  prior  is  assumed.  Otherwise,  the  poste¬ 
rior  distribution  is  approximated  by  the  Laplace  approximation  [116],  the  Markov  chain 
Monte  Carlo  sampling  [10],  or  the  variational  approximation  [13,  57].  The  number  of 
clusters  can  be  determined  based  on  the  maximization  of  the  marginal  likelihood : 

pg(X)  =  argmax  JJp0(X\f3,  n)pe(f3)pg(n)d(3d7r.  (48) 

The  generative  clustering  methods  are  statistically  well-founded.  However,  density 
models  for  each  cluster  p(x\y)  need  to  be  specified  in  advance,  which  lacks  flexibility 
in  practice.  Furthermore,  in  the  Bayesian  approach,  the  choice  of  cluster  models  and 
prior  distributions  are  often  limited  to  conjugate  pairs  in  practice.  On  the  other  hand, 
in  the  frequentist  approach,  only  local  solutions  can  be  obtained  in  practice  due  to  the 
non-convexity  caused  by  mixture  modeling. 

5.3.6  Posterior-Maximization  Clustering 

Another  possible  clustering  approach  based  on  probabilistic  inference  is  to  directly  max¬ 
imizes  the  posterior  probability  of  class  labels  y  =  {)(/?; }f=i  [24]: 

maxp(y\X). 

Let  us  model  the  cluster-wise  data  distribution  p(X\y)  by  pg(X\y,/3). 

An  approximate  inference  method  called  iterative  conditional  modes  [104]  alternatively 
maximizes  the  posterior  probabilities  of  y  and  / 3  until  convergence: 

y<^Po(y\x,$), 

p<^P0(p\x,y). 

When  the  Gaussian  model  with  covariance  identity  is  assumed  for  pg(y\X,  (3),  this  algo¬ 
rithm  is  reduced  to  the  k-means  algorithm  (see  Section  5.3.1)  under  the  uniform  priors. 

Let  us  consider  the  class-prior  probability  p(y)  and  model  it  by  pg(y\ir).  Introducing 
the  prior  distributions  pg((3 )  and  pg(n),  we  can  approximate  the  posterior  distribution  of 
y  as 

Pe{y\X)  oc  JJ pg(X\y,f3)pg((3)pg(y\7r)pg(Tr)d(3dn. 
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Similarly  to  generative  clustering  described  in  Section  5.3.5,  conjugate  priors  such  as 
the  Gauss-Wishart  prior  and  the  Dirichlet  prior  are  practically  useful  in  improving  the 
computational  efficiency.  The  number  of  clusters  can  also  be  similarly  determined  by 
maximizing  the  marginal  likelihood  (48).  However,  direct  optimization  of  y  is  often 
computationally  intractable  due  to  cn  combinations,  where  c  is  the  number  of  clusters 
and  n  is  the  number  of  samples.  For  this  reason,  efficient  sampling  schemes  such  as  the 
Markov  chain  Monte  Carlo  are  indispensable  in  this  approach. 

A  Dirichlet  process  mixture  [51,  11]  is  a  non-parametric  extension  of  the  above  ap¬ 
proach,  where  an  infinite  number  of  clusters  are  implicitly  considered  and  the  number 
of  clusters  is  automatically  determined  based  on  observed  data.  In  order  to  improve  the 
computational  efficiency  of  this  infinite  mixture  approach,  various  approximation  schemes 
such  as  Markov  chain  Monte  Carlo  sampling  [121]  and  variational  approximation  [25]  have 
been  introduced.  Furthermore,  variants  of  Dirichlet  processes  such  as  hierarchical  Dirich¬ 
let  processes  [181],  nested  Dirichlet  processes  [141],  and  dependent  Dirichlet  processes 
[109]  have  been  developed  recently. 

However,  even  in  this  non-parametric  Bayesian  approach,  density  models  for  each 
cluster  still  need  to  be  parametrically  specified  in  advance,  which  is  often  restricted  to 
Gaussian  models  in  practice.  This  highly  limits  the  flexibility  of  clustering. 

5.3.7  Dependence-Maximization  Clustering 

The  Hilbert- Schmidt  independence  criterion  (HSIC)  [61]  is  a  dependence  measure  based 
on  a  reproducing  kernel  function  K(x,  x')  [12].  [153]  proposed  a  dependence-maximization 
clustering  method  called  clustering  with  HSIC  (CLUHSIC),  which  tries  to  determine  clus¬ 
ter  assignments  so  that  their  dependence  on  feature  vectors  {cCj}”=1  is  maximized. 

More  specifically,  CLUHSIC  tries  to  find  the  cluster  indicator  matrix  II  (see  Sec¬ 
tion  5.3.4)  that  maximizes 


tr(FOIAIIT), 

where  Kt  J  :=  K(xi,xf)  and  A  is  a  c  x  c  cluster-cluster  similarity  matrix.  Note  that 
nAnT  can  be  regarded  as  the  kernel  matrix  for  cluster  assignments.  [153]  used  a  greedy 
algorithm  to  optimize  the  cluster  indicator  matrix,  which  is  computationally  demanding. 
[215]  gave  spectral  and  semi-definite  relaxation  techniques  to  improve  the  computational 
efficiency  of  CLLIHSIC. 

HSIC  is  a  kernel-based  independence  measure  and  the  kernel  function  K(x,x')  needs 
to  be  determined  in  advance.  However,  there  is  no  systematic  model  selection  strategy  for 
HSIC,  and  using  the  Gaussian  kernel  with  width  set  to  the  median  distance  between  sam¬ 
ples  is  a  standard  heuristic  in  practice  [145].  On  the  other  hand,  our  proposed  approach 
is  equipped  with  an  objective  model  selection  strategy,  which  is  a  notable  advantage  over 
CLUHSIC. 

Another  line  of  dependence-maximization  clustering  adopts  mutual  information  (MI) 
as  a  dependency  measure.  Recently,  a  dependence-maximization  clustering  method  called 


71 


mean  nearest-neighbor  (MNN)  clustering  was  proposed  [50].  MNN  is  based  on  the  k- 
nearest-neighbor  entropy  estimator  proposed  by  [102], 

The  performance  of  the  original  fc-nearest- neighbor  entropy  estimator  depends  on  the 
choice  of  the  number  of  nearest  neighbors,  k.  On  the  other  hand,  MNN  avoids  this 
problem  by  introducing  a  heuristic  of  taking  an  average  over  all  possible  k.  The  resulting 
objective  function  is  given  by 

log(||*i  -a^-f  +  e),  (49) 

Tly  1  . 

y= 1  iTj-yi=yj=y 

where  e  (>  0)  is  a  smoothing  parameter.  Then  this  objective  function  is  minimized  with 
respect  to  cluster  assignments  {?/«}”=  i  using  a  greedy  algorithm. 

Although  the  fact  that  the  tuning  parameter  k  is  averaged  out  is  convenient,  this 
heuristic  is  not  well  justified  theoretically.  Moreover,  the  choice  of  the  smoothing  param¬ 
eter  e  is  arbitrary.  In  the  MATLAB  code  provided  by  one  of  the  authors,  e  =  1  jn  was 
recommended,  but  there  seems  no  justification  for  this  choice.  Also,  due  to  the  greedy  op¬ 
timization  scheme,  MNN  is  computationally  expensive.  On  the  other  hand,  our  proposed 
approach  offers  a  well-justified  model  selection  strategy,  and  the  SMI-based  clustering 
gives  an  analytic-form  solution  which  can  be  computed  efficiently. 


5.3.8  Information-Maximization  Clustering  with  Mutual  Information 

Finally,  we  review  methods  of  information-maximization  clustering  based  on  mutual  infor¬ 
mation  [1,  60],  which  belong  to  the  same  family  of  clustering  algorithms  as  our  proposed 
method. 

Mutual  information  (MI)  is  defined  and  expressed  as 


MI  :=  /  '^Tpix/y)  log 
J  y= i 


p(x,y) 

P{x)p(y ) 


dx 


5 ^p(y\x)p(x )  \°gp(y\x)d 

y=l 


X 


^p(y\x)p(x)logp(y)dx.  (50) 
y=  i 


Let  us  approximate  the  class-posterior  probability  p(y\x)  by  a  conditional-probability 
model  p(y\x] «)  with  parameter  a.  Then  the  marginal  probability  p(y)  can  be  approxi¬ 
mated  as 


p(y)  = 


j p(y\x)p(x)dx  « 


1  X  A 

-J2p(y  !**;«)• 

n  “ 


(51) 


By  further  approximating  the  expectation  with  respect  to  p(x)  included  in  Eq.(50)  by  the 
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empirical  average  of  samples  {cCj}”=1,  the  following  MI  estimator  can  be  obtained  [1,  60]: 


MI  :=  -  VVj)(t/ \Xi,  a)  \ogp(y\xp  a) 

n  z— '  z — ' 


*= i  ?/=i 


(  ~^v{y\xi]cx)  1  log  (  - ^2p{y\xj-,a) )  . 


.  n 

y=l  \  i=l 


1=1 


(52) 


In  [1],  the  Gaussian  model, 

p(2/|tc;  «)  oc  exp 


x 


2si 


+ 


(or  its  kernclized  version)  is  adopted,  where  a  =  { cy ,  sy,  by}cy=l  is  the  parameter.  Then  a 

local  maximizer  of  MI  with  respect  to  the  parameter  at  is  found  by  a  gradient  method. 
On  the  other  hand,  in  [60],  the  logistic  model 


p(y\x;a)  oc  exp  (oJa;)  , 


(53) 


(or  its  kernelized  version)  is  adopted,  where  ot  =  {oty}y= l  is  the  parameter.  Then  a  local 
maximizer  of  MI  with  respect  to  the  parameter  cx.  is  found  by  a  quasi-Newton  method. 
Finally,  cluster  assignments  {y,}"=1  are  determined  as 


Hi  =  argmax  p(y\xi,  a), 
y 

where  a  is  a  local  maximizer  of  MI.  Below,  we  refer  to  the  above  method  as  Mi-based 
clustering  (MIC). 

In  the  kernclized  version  of  MIC,  the  user  needs  to  determine  parameters  included 
in  the  kernel  function  such  as  the  kernel  width  or  the  number  of  nearest  neighbors.  [1] 
proposed  to  choose  the  kernel  parameters  so  that  MI  (52)  is  maximized.  Thus,  cluster 
assignments  and  kernel  parameters  can  be  consistently  determined  under  the  common 
guidance  of  maximizing  MI.  However,  since  MI  is  an  unsupervised  estimator  of  MI,  it  is 
not  accurately  enough;  in  the  model  selection  stage,  cluster  labels  {r/i}]l=1  are  available  and 
thus  supervised  estimation  of  MI  is  more  favorable.  Indeed,  there  exists  a  more  powerful 
supervised  MI  estimator  called  maximum-likelihood  MI  (MLMI)  [180],  which  was  proved 
to  achieve  the  optimal  non-parametric  convergence  rate. 

The  derivation  of  MLMI  follows  a  similar  line  to  LSMI  explained  in  Section  5.2.4,  i.e., 
the  density-ratio  function  (36)  is  learned.  More  specifically,  the  following  density-ratio 
model  rg(x,y-,0)  is  used: 


re(x,y,d)  \=  ^  0£L(x,xe), 
tw=v 


where  6  =  (0\, . . .  ,6n)T  and  L(x,  x')  is  a  kernel  function  with  a  kernel  parameter  7. 
Then  the  parameter  6  is  learned  so  that  the  Kullback-Leiblcr  divergence  from  p(x,y) 


73 


to  rg(x,  y\  G)p(x)p(y)  is  minimized15.  An  empirical  version  of  the  MLMI  optimization 
problem  is  given  as 

1  n 

max  -  V  log  re  (x^y^G) 

0  n  z ' 

i= 1 

l  n 

s.t.  —  ro{xl,yf  G)  =  1  and  G  >  0„, 

rr  z ' 

i,j= i 

where  the  inequality  for  vectors  is  applied  in  the  element-wise  manner.  This  is  a  convex 
optimization  problem,  and  thus  the  global  optimal  solution  G,  which  tends  to  be  sparse, 
can  be  easily  obtained  by,  e.g.,  iteratively  performing  gradient  ascent  and  projection  [173]. 
Then  an  MI  estimator  called  MLMI  is  given  as  follows: 

1  n 

MLMI  :=  -  Vlog rB{Xi,yi,Q). 

n  z — J 

i=  1 

The  kernel  parameter  7  included  in  the  kernel  function  L(x,x')  can  be  optimized  by 
cross-validation,  in  the  same  way  as  LSMI  [180]. 

5.4  Experiments 

In  this  section,  we  experimentally  evaluate  the  performance  of  the  proposed  and  existing 
clustering  methods. 

5.4.1  Illustration 

First,  we  illustrate  the  behavior  of  the  proposed  method  using  the  following  4  artificial 
datasets  with  dimensionality  d  =  2  and  sample  size  n  =  200: 

(a)  Four  Gaussian  blobs:  For  the  number  of  classes  c  =  4,  samples  in  each  class  are 

drawn  from  the  Gaussian  distributions  with  mean  (2,  2)T ,  ( — 2, 2)T,  (2,  — 2)T,  and 
(—2,  — 2)T  and  covariance  matrix  0.25/2,  respectively. 

(b)  Circle  &  Gaussian:  For  c  =  2,  samples  in  one  class  are  drawn  from  the  2- 

dimensional  standard  normal  distribution,  and  samples  in  the  other  class  are  equi- 
distantly  located  on  the  origin- centered  circle  with  radius  5.  Then  noise  following 
the  origin- centered  normal  distribution  with  covariance  matrix  O.OIJ2  is  added  to 
each  sample. 

(c)  Double  spirals:  For  c  =  2,  the  i-th  sample  in  one  class  is  given  by 

cos(mj),  ii  sin(mj))  ,  and  the  i-th  sample  in  the  other  class  is  given  by 
(— £iCos(mi),  —  li  sin(mi))T,  where  £,  =  1  +  4(i  —  l)/n  and  m;  =  37r(i  —  l)/n.  Then 
noise  following  the  origin-centered  normal  distribution  with  covariance  matrix  O.OI/2 
is  added  to  each  sample. 

15Note  that  rg(x ,  y,  0)p(x)p(y)  can  be  regarded  as  a  model  of  p(x ,  y). 
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(d)  High  Sc  low  densities:  For  c  =  2,  samples  in  one  class  are  drawn  from  the  2- 
dimensional  standard  normal  distribution,  and  samples  in  the  other  class  are  drawn 
from  the  2-dimensional  origin-centered  normal  distribution  with  covariance  matrix 
0.01/2. 

The  class-prior  probability  was  set  to  be  uniform.  The  generated  samples  were  central¬ 
ized  and  their  variance  was  normalized  in  the  dimension-wise  manner  (see  the  top  row  of 
Figure  26).  As  a  kernel  function,  we  used  the  sparse  local-scaling  kernel  (34)  for  SMIC, 
where  the  kernel  parameter  t  was  chosen  from  {1, . . . ,  10}  based  on  LSMI  with  the  Gaus¬ 
sian  kernel  (38). 

The  top  graphs  in  Figure  26  depict  the  cluster  assignments  obtained  by  SMIC  with 
the  uniform  class-prior,  and  the  bottom  graphs  in  Figure  26  depict  the  model  selection 
curves  obtained  by  LSMI  (i.e.,  the  values  of  LSMI  as  functions  of  the  model  parameter  t). 
The  clustering  performance  was  evaluated  by  the  adjusted  Rand  index  (ARI)  [74]  between 
inferred  cluster  assignments  and  the  ground  truth  categories  (see  Appendix  5.8  for  the 
details  of  ARI).  Larger  ARI  values  mean  better  performance,  and  ARI  takes  its  maximum 
value  1  when  two  sets  of  cluster  assignments  are  identical.  The  results  show  that  SMIC 
combined  with  LSMI  works  well  for  these  toy  datasets. 

Figure  27  depicts  the  cluster  assignments  and  model  selection  curves  obtained  by  MIC 
with  MLMI  (see  Section  5.3.8),  where  pre-training  of  the  kernel  logistic  model  using  the 
cluster  assignments  obtained  by  self-tuning  spectral  clustering  [216]  was  carried  out  for 
initializing  MIC  [60].  The  figure  shows  that  qualitatively  good  clustering  results  were 
obtained  for  the  datasets  (a)  and  (b).  However,  for  the  datasets  (c)  and  (d),  poor  results 
were  obtained  due  to  local  optima  of  the  objective  function  (52). 

Figure  28  and  Figure  29  depict  class-posterior  probabilities  estimated  by  SMIC  and 
MIC,  respectively.  The  plots  show  that,  for  the  datasets  (a),  (b),  and  (c)  where  the  clus¬ 
ters  are  clearly  separated,  the  estimated  class-posterior  probabilities  are  almost  zero-one 
functions  and  thus  the  class  prediction  is  highly  certain.  On  the  other  hand,  for  the  dataset 
(d)  where  the  two  clusters  are  overlapped,  the  estimated  class-posterior  probabilities  tend 
to  take  intermediate  class-posterior  probabilities. 


5.4.2  Influence  of  Imbalanced  Class-Prior  Probabilities 


Next,  we  experimentally  investigate  how  imbalanced  class-prior  probabilities  (i.e.,  the 
sample  size  in  each  cluster  is  significantly  different)  influence  the  clustering  performance 
of  SMIC. 

We  continue  using  the  4  artificial  datasets  used  in  Section  5.4.1,  but  we  set  the  true 
class-prior  probability  as 


p(y  = !) 
p(y  =  3) 


p(y  =  2)  =  0.1,0.15,0.2,0.25, 
p(y  =  4)  =  1  ~  pfo  =  1)  -  p(y  =  2) 
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ARI  =  1 


ARI  =  1 


ARI  =  1 


ARI  =  0.773 


Figure  26:  Illustrative  examples.  Cluster  assignments  obtained  by  SMIC  (top)  and  model 
selection  curves  obtained  by  LSMI  (bottom). 
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(a)  Four  Gaussian  blobs  (b)  Circle  &  Gaussian  (c)  Double  spirals  (d)  High  &  low  densities 

Figure  27:  Illustrative  examples.  Cluster  assignments  obtained  by  MIC  (top)  and  model 
selection  curves  obtained  by  MLMI  (bottom). 


76 


100 

Sample  index 


(a)  Four  Gaussian  blobs 


(b)  Circle  &  Gaussian 


100 

Sample  index 


(c)  Double  spirals  (d)  High  &  low  densities 

Figure  28:  Illustrative  examples.  Class-posterior  probabilities  estimated  by  SMIC. 


(a)  Four  Gaussian  blobs 


(b)  Circle  &  Gaussian 


(c)  Double  spirals 


(d)  High  &  low  densities 


Figure  29:  Illustrative  examples.  Class-posterior  probabilities  estimated  by  MIC. 
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(a)  Four  Gaussian  blobs 


(b)  Circle  and  Gaussian 


(c)  Double  spirals  (d)  High  and  low  densities 


Figure  30:  Illustrative  examples.  The  mean  ARI  over  100  runs  as  functions  of  the  class- 
prior  probability  p(y  =  1).  The  two  methods  were  judged  to  be  comparable  in  terms  of 
the  average  ARI  by  the  t-test  at  the  significance  level  1%. 


for  the  dataset  (a),  and 


p{y  =  1)  =  0.2,  0.3,  0.4,  0.5, 

P(y  =  2)  =  1  -p(y  =  1), 

for  the  datasets  (b)-(d).  The  following  2  approaches  are  compared: 

SMIC:  SMIC  with  the  uniform  class-prior  probabilities  7Ti  =  7r2  =  1/2. 

SMIC*:  SMIC  with  the  true  class-prior  probabilities  7Ti  =  p(y  =  1)  and  7T2  =  p(y  =  2). 

The  mean  and  standard  deviation  of  ARI  over  100  runs  are  plotted  in  Figure  30, 
showing  that  the  difference  between  SMIC  and  SMIC*  is  negligibly  small.  Indeed,  the 
two  methods  were  judged  to  be  comparable  to  each  other  in  terms  of  the  average  ARI  by 
the  t-test  at  the  significance  level  1%  for  all  tested  cases.  This  would  be  a  natural  result 
in  clustering  because  class-prior  probabilities  only  mildly  affect  cluster  boundaries  and 
such  mild  change  in  cluster  boundaries  do  not  significantly  affect  clustering  solutions. 

The  above  results  imply  that  SMIC  is  not  sensitive  to  the  choice  of  class-prior  prob¬ 
abilities.  Thus,  in  practice,  SMIC  with  the  uniform  class-prior  distribution  may  be  used 
when  the  true  class-prior  is  unknown. 
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5.4.3  Performance  Comparison 


Finally,  we  systematically  compare  the  performance  of  the  proposed  and  existing  clus¬ 
tering  methods  using  various  real-world  datasets  such  as  images,  natural  languages,  ac- 
celerometric  sensors,  and  speeches. 

Setup  We  compared  the  performance  of  the  following  methods,  all  of  which  do  not 
contain  open  tuning  parameters  and  therefore  experimental  results  are  fair  and  objective: 

KM:  K-means  [117]  (see  also  Section  5.3.1).  We  used  the  software  included  in  the  MAT- 
LAB  Statistics  Toolbox,  where  initial  values  were  randomly  generated  100  times 
and  the  best  result  in  terms  of  the  k-means  objective  value  was  chosen  as  the  final 
solution. 

SCI:  Spectral  clustering  [149,  123]  (see  also  Section  5.3.2)  with  the  Gaussian  similarity. 
The  Gaussian  width  is  set  to  the  median  distance  between  all  samples,  which  is  a 
popular  heuristic  in  kernel  methods  [145].  We  used  the  publicly  available  MATLAB 
code16,  where  the  post  k-means  processing  was  repeated  10  times  with  heuristic 
initialization:  The  first  center  was  chosen  randomly  from  samples,  and  then  the 
next  center  was  iteratively  set  to  the  farthest  sample  from  the  previous  ones.  The 
best  result  in  terms  of  the  k-means  objective  value  over  10  repetitions  was  chosen 
as  the  final  solution. 

SC2:  Spectral  clustering  with  the  self-tuning  local-scaling  similarity  [216],  instead  of  the 
Gaussian  similarity. 

MNN:  Mean  nearest-neighbor  clustering  [50]  (see  also  Section  5.3.7).  We  used  the  MAT- 
LAB  code  provided  by  one  of  the  authors1 7 .  Following  the  suggestions  provided  in 
the  program  code,  the  number  of  iterations  was  set  to  10  and  the  smoothing  pa¬ 
rameter  e  (see  Eq.(49))  was  set  to  e  =  lfn. 

MIC:  Mi-based  clustering  with  kernel  logistic  models  and  the  sparse  local-scaling  kernel 
[60]  (see  also  Section  5.3.8),  where  model  selection  is  carried  out  by  maximum- 
likelihood  MI  (MLMI)  [180].  We  implemented  this  method  using  MATLAB,  which 
is  a  combination  of  the  MIC  code  personally  provided  by  one  of  the  authors,  and  the 
MLMI  code  available  from  the  web  page  of  one  of  the  authors18.  Following  the  sug¬ 
gestion  provided  in  the  original  program  code,  MIC  was  initialized  by  pre-training 
of  the  kernel  logistic  model  using  the  cluster  assignments  obtained  by  spectral  clus¬ 
tering.  The  tuning  parameter  t  included  in  the  sparse  local-scaling  kernel  (34)  was 
chosen  from  {1, . . . ,  10}  based  on  MLMI  with  Gaussian  kernels  (see  Section  5.3.8). 
The  Gaussian  kernel  width  in  MLMI  was  chosen  from  (10~2, 10-1"5, 10”1, . . . ,  102} 


16http : //web ee . technion. ac . il/~lihi/Demos/SelfTuningClustering .html 
1 ' http : / / www . levf aivishevsky . webs . com/NIC . rar 

18http : / / sugiyama-www. cs . titech. ac . jp/~sugi/ software/MLMI/ index.html 
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based  on  cross-validation.  As  suggested  in  the  MLMI  code  provided  by  the  au¬ 
thor,  the  number  of  kernel  bases  in  MLMI  was  limited  to  200,  which  were  randomly 
chosen  from  all  n  kernels. 

SMIC:  SMI-based  clustering  with  the  sparse  local-scaling  kernel  and  the  uniform  class- 
prior  distribution  (see  Section  5.2.3),  where  model  selection  is  carried  out  by  least- 
squares  MI  (LSMI)  [179]  (see  also  Section  5.2.4).  We  implemented  SMIC  and 
LSMI  using  MATLAB  by  ourselves.  The  tuning  parameter  t  included  in  the 
sparse  local-scaling  kernel  (34)  was  chosen  from  {1, . . . ,  10}  based  on  LSMI  with 
Gaussian  kernels  (see  Section  5.2.4).  The  Gaussian  kernel  width  and  regulariza¬ 
tion  parameter  included  in  LSMI  were  chosen  from  {10“2, 10-1'5, 10-1, . . . ,  102}  and 
(10“3, 10~25, 10~2, . . . ,  101},  respectively,  based  on  cross-validation.  Similarly  to 
MLMI,  the  number  of  kernel  bases  in  LSMI  was  limited  to  200,  which  were  ran¬ 
domly  chosen  from  all  n  kernels. 

In  addition  to  the  clustering  quality  in  terms  of  ARI,  we  also  evaluated  the  computa¬ 
tional  efficiency  of  each  method  by  the  CPLT  computation  time. 

Datasets  We  used  the  following  6  real-world  datasets. 

Digit  (d  =  256,  n  =  5000,  and  c  =  10):  The  USPS  hand-written  digit  dataset19,  which 
contains  9298  digit  images.  Each  image  consists  of  256  (=  16  x  16)  pixels  and  rep¬ 
resents  a  digit  in  (0, 1,  2, ... ,  9}.  Each  pixel  takes  a  value  in  [—1,  +1]  corresponding 
to  the  intensity  level  in  gray-scale.  We  randomly  chose  500  samples  from  each  of 
the  10  classes,  and  used  5000  samples  in  total. 

Face  ( d  =  4096, n  =  100,  and  c  =  10):  The  Olivetti  Face  dataset20,  which  contains  400 
gray-scale  face  images  (40  people;  10  images  per  person).  Each  image  consists  of 
4096  (=  64  x  64)  pixels  and  each  pixel  takes  an  integer  value  between  0  and  255  as 
the  intensity  level.  We  randomly  chose  10  people,  and  used  100  samples  in  total. 

Document  ( d  =  50,  n  =  700,  and  c  =  7):  The  20-Newsgroups  dataset21,  which  contains 
20000  newsgroup  documents  across  20  different  newsgroups.  We  merged  the  20 
newsgroups  into  the  following  7  top-level  categories:  “comp”,  “rec”,  usci” ,  “talk” , 
“ alt ”,  “misc” ,  and  “soc” .  Each  document  is  expressed  by  a  10000-dimensional 
bag-of-words  vector  of  term- frequencies.  Following  the  convention  [78],  we  trans¬ 
formed  the  term-frequency  vectors  to  the  term  frequency /inverse  document  fre¬ 
quency  (TFIDF)  vector,  i.e.,  we  multiplied  the  term- frequency  by  the  logarithm 
of  the  inverse  ratio  of  the  documents  containing  the  corresponding  word.  We  ran¬ 
domly  chose  100  samples  from  each  of  the  7  classes,  and  used  700  samples  in  total. 
We  applied  principal  component  analysis  (PGA)  [130,  79]  to  the  700  samples,  and 
extracted  50-dimensional  feature  vectors. 

19http : //www. gauss ianprocess . org/gpml/data/ 

20http : / / www . cs . toronto . edu/~roweis/ data . html 
21http : //people . csail .mit . edu/jrennie/20Newsgroups/ 
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Word  (d  =  50,  n  =  300,  and  c  =  3):  The  SENSEVAL-2  dataset22  for  word-sense  disam¬ 
biguation.  We  took  the  noun  “ interest ”  appeared  in  1930  contexts,  having  3  different 
meanings:  “advantage,  advancement  or  favor” ,  “a  share  in  a  company  or  business” , 
and  “money  paid  for  the  use  of  money”  (i.e.,  3  classes).  From  each  surrounding  con¬ 
text,  we  extracted  a  14936-dimensional  feature  vector  [127],  which  includes  three 
types  of  features:  part- of- speech  of  neighboring  words  with  position  information, 
bag-of-words  in  the  surrounding  context,  and  local  collocation  [106].  We  randomly 
chose  100  samples  from  each  of  the  3  classes,  and  used  300  samples  in  total.  We 
applied  PCA  to  the  300  samples,  and  extracted  50-dimensional  feature  vectors. 

Accelerometry  (d  =  5 ,n  —  300,  and  c  =  3):  The  ALKAN  dataset23,  which  contains  3- 
axis  (i.e.,  x-,  y-,  and  z-axes)  accelerometric  data  collected  by  the  iPod  touch.  In 
the  data  collection  procedure,  subjects  were  asked  to  perform  three  specific  tasks: 
walking,  running,  and  standing  up.  The  duration  of  each  task  was  arbitrary,  and  the 
sampling  rate  was  20Hz  with  small  variations.  Each  data-stream  was  then  segmented 
in  a  sliding  window  manner  with  window  width  5  seconds  and  sliding  step  1  second 
[66].  Depending  on  subjects,  the  position  and  orientation  of  the  accelerometer  was 
arbitrary — held  by  hand  or  kept  in  a  pocket  or  a  bag.  For  this  reason,  we  took  the 
^2-norm  of  the  3-dimensional  acceleration  vector  at  each  time  step,  and  computed 
the  following  5  orientation-invariant  features  from  each  window:  mean,  standard 
deviation,  fluctuation  of  amplitude,  average  energy,  and  frequency- domain  entropy 
[17,  18].  We  randomly  chose  100  samples  from  each  of  the  3  classes,  and  used  300 
samples  in  total. 

Speech  (d  =  50,  n  =  400,  and  c  =  2):  An  in-house  speech  dataset,  which  contains  short 
utterance  samples  recorded  from  2  male  subjects  speaking  in  French  with  sampling 
rate  44.1kHz.  From  each  utterance  sample,  we  extracted  a  50-dimensional  line 
spectral  frequencies  vector  [80].  We  randomly  chose  200  samples  from  each  class, 
and  used  400  samples  in  total. 

For  each  dataset,  the  experiment  was  repeated  100  times  with  random  choice  of  sam¬ 
ples  from  the  database,  where  the  cluster  size  is  balanced.  Samples  were  centralized 
and  their  variance  was  normalized  in  the  dimension-wise  manner,  before  feeding  them  to 
clustering  algorithms. 

Results  The  experimental  results  are  described  in  Table  1.  For  the  digit  dataset,  MIC 
and  SMIC  outperform  KM,  SCI,  SC2,  and  MNN  in  terms  of  ARI.  The  entire  computation 
time  of  SMIC  including  model  selection  is  faster  than  the  other  methods.  For  the  face 
dataset,  SC2,  MIC,  and  SMIC  are  comparable  to  each  other  and  are  better  than  KM, 
SCI,  and  MNN  in  terms  of  ARI.  For  the  document  and  word  datasets,  SMIC  tends  to 
outperform  the  other  methods.  For  the  accelerometry  dataset,  MNN  performs  the  best 
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and  SMIC  follows.  Finally,  for  the  speech  dataset,  MIC  and  SMIC  work  comparably  well, 
and  are  significantly  better  than  the  other  methods. 

The  above  results  showed  that  MIC  worked  reasonably  well,  implying  that  the  MLMI- 
based  model  selection  strategy  is  practically  useful.  However,  SMIC  was  shown  to  work 
even  better  than  MIC,  with  much  less  computation  time.  The  accuracy  improvement  of 
SMIC  over  MIC  was  gained  by  computing  the  SMIC  solution  in  a  closed-form  without 
any  heuristic  initialization.  The  computational  efficiency  of  SMIC  was  brought  by  the 
analytic  computation  of  the  optimal  solution  and  the  class-wise  optimization  of  LSMI 
(see  Section  5.2.4). 

The  performance  of  MNN  and  SC2  was  rather  unstable  because  of  the  heuristic  aver¬ 
aging  of  the  number  of  nearest  neighbors  in  MNN  and  the  heuristic  choice  of  local  scaling 
in  SC.  In  terms  of  computation  time,  they  are  relatively  efficient  for  small-  to  medium¬ 
sized  datasets,  but  they  are  expensive  for  the  largest  dataset,  digit.  SCI  did  not  perform 
as  well  as  SC2,  except  for  the  digit  dataset.  KM  was  not  reliable  for  the  document  and 
speech  datasets  because  of  the  restriction  that  the  cluster  boundaries  are  linear.  For  the 
digit ,  face,  and  document  datasets,  KM  was  computationally  very  expensive  since  a  large 
number  of  iterations  were  needed  until  convergence  to  a  local  optimum  solution. 

We  also  performed  similar  experiments  with  smaller  numbers  of  samples.  Table  2 
describes  the  results,  showing  that  the  tendency  of  the  experimental  does  not  change 
significantly  and  the  proposed  SMIC  still  performs  well. 

Finally,  we  considered  the  imbalanced  setup  where  the  sample  size  of  the  first  class 
was  set  to  be  m  times  larger  than  other  classes  with  the  total  number  of  samples  fixed  to 
the  same  number.  The  results  are  summarized  in  Table  3,  showing  that  the  performance 
of  all  methods  tends  to  be  degraded  as  the  degree  of  cluster  imbalance  increases.  This 
implies  that  clustering  becomes  more  challenging  if  the  cluster  size  is  imbalanced.  Among 
the  compared  methods,  SMIC  (with  the  uniform  prior)  still  worked  better  than  other 
methods. 

Overall,  the  proposed  SMIC  combined  with  LSMI  was  shown  to  be  a  practically  useful 
alternative  to  existing  clustering  approaches. 

5.5  Conclusions 

In  this  paper,  we  proposed  a  novel  information-maximization  clustering  method  that 
learns  class-posterior  probabilities  in  an  unsupervised  manner  so  that  the  squared-loss 
mutual  information  (SMI)  between  feature  vectors  and  cluster  assignments  is  maximized. 
The  proposed  algorithm,  called  SMI-based  clustering  (SMIC),  allows  us  to  obtain  clus¬ 
tering  solutions  analytically  by  solving  a  kernel  eigenvalue  problem.  Thus,  unlike  the 
previous  information- maximization  clustering  methods  [1,  60],  SMIC  does  not  suffer  from 
the  problem  of  local  optima.  Furthermore,  we  proposed  to  use  an  optimal  non-parametric 
SMI  estimator  called  least-squares  mutual  information  (LSMI)  for  data-driven  parameter 
optimization.  Through  experiments,  SMIC  combined  with  LSMI  was  demonstrated  to 
compare  favorably  with  existing  clustering  methods. 

In  experiments,  the  proposed  clustering  method  was  shown  to  be  useful  for  various 
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Tabic  1:  Experimental  results  on  real-world  datasets  (with  equal  cluster  size).  The  average 
clustering  accuracy  (and  its  standard  deviation  in  the  bracket)  in  terms  of  ARI  and  the 
average  CPU  computation  time  in  second  over  100  runs  are  described.  Larger  ARI  is 
better,  and  shorter  computation  time  is  preferable.  The  best  method  in  terms  of  the 
average  ARI  and  methods  judged  to  be  comparable  to  the  best  one  by  the  t-test  at  the 
significance  level  1%  are  described  in  boldface.  Computation  time  of  MIC  and  SMIC 
corresponds  to  the  time  for  computing  a  clustering  solution  after  model  selection  has 
been  carried  out.  For  references,  computation  time  for  the  entire  procedure  including 
model  selection  is  described  in  the  square  bracket,  which  depends  on  the  number  of 
model  candidates  (in  the  current  setup,  we  had  81  (=  9  x  9)  candidates). 
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Digit  ( d 
SCI 

=  256,  n  = 
SC2 

5000,  and  c  = 
MNN 

10) 

MIC 

SMIC 

ARI 

Time 

0.42(0.01) 

1414.6 

0.46(0.01) 

561.3 

0.24(0.02) 

495.1 

0.44(0.03) 

228.4 

0.63(0.08) 
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KM 

Face  ( d 
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SC2 
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MNN 
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KM 
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SMIC 
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Time 

0.00(0.00) 
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0.00(0.00) 
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SMIC 
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0.01(0.02) 

1.7 
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0.02(0.02) 

1.7 

0.04(0.04) 

1.4[85.6] 
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KM 

Accelerometry  (d  =  5, 
SCI  SC2 

n  =  300,  and 
MNN 

c  =  3) 

MIC 

SMIC 

ARI 

Time 

0.50(0.03) 

0.2 

0.20(0.26) 

1.7 

0.60(0.16) 

1.7 

0.73(0.05) 

1.8 

0.61(0.24) 

1.3[137.2] 

0.68(0.12) 
0.6  [36.4] 

KM 

Speech  (d  =  50,  n  - 
SCI  SC2 

=  400,  and  c  = 
MNN 

:  2) 

MIC 

SMIC 

ARI 

Time 

0.00(0.00) 

0.4 

0.00(0.00) 

2.1 

0.00(0.00) 

1.9 

0.04(0.15) 

1.8 

0.18(0.16) 

1.3[134.3] 

0.21(0.25) 

0.5[43.0] 
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Table  2:  Experimental  results  on  real-world  datasets  for  different  numbers  of  samples. 
ARI  values  are  described  in  the  table.  The  results  for  n  are  the  same  as  the  ones  reported 
in  Table  1. 


Digit  (d  = 

256,  n  =  5000,  and  c  =  10) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.42(0.01) 

0.46(0.01) 

0.24(0.02) 

0.44(0.03) 

0.63(0.08) 

0.63(0.05) 

n*  3/4 

0.43(0.01) 

0.47(0.01) 

0.25(0.02) 

0.45(0.03) 

0.64(0.09) 

0.65(0.05) 

n  *  1/2 

0.43(0.02) 

0.47(0.01) 

0.26(0.02) 

0.44(0.04) 

0.61(0.12) 

0.64(0.05) 

n*  1/4 

0.41(0.02) 

0.45(0.02) 

0.28(0.03) 

0.43(0.04) 

0.60(0.10) 

0.59(0.06) 

Face  (d  = 

4096,  n  =  100,  and  c  =  10) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.60(0.11) 

0.37(0.08) 

0.62(0.11) 

0.47(0.10) 

0.64(0.12) 

0.65(0.12) 

n*  3/4 

0.59(0.12) 

0.29(0.07) 

0.53(0.12) 

0.41(0.11) 

0.62(0.12) 

0.64(0.12) 

n  *  1/2 

0.60(0.14) 

0.17(0.08) 

0.36(0.12) 

0.26(0.11) 

0.55(0.12) 

0.57(0.13) 

Document  1 

[d  =  50,  n  = 

700,  and  c  =  7) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.00(0.00) 

0.00(0.00) 

0.09(0.02) 

0.09(0.02) 

0.01(0.02) 

0.19(0.03) 

n  *  3/4 

0.00(0.00) 

0.01(0.03) 

0.10(0.02) 

0.09(0.02) 

0.01(0.02) 

0.20(0.03) 

n  *  1/2 

0.00(0.00) 

0.04(0.05) 

0.10(0.02) 

0.09(0.02) 

0.02(0.03) 

0.19(0.03) 

n  *  1/4 

0.00(0.00) 

0.10(0.05) 

0.11(0.03) 

0.10(0.03) 

0.03(0.04) 

0.19(0.05) 

Word  (d 

=  50,  n  =  300,  and  c  =  3) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.04(0.05) 

0.01(0.02) 

0.02(0.01) 

0.02(0.02) 

0.04(0.04) 

0.08(0.05) 

n  *  3/4 

0.02(0.03) 

0.00(0.01) 

0.02(0.02) 

0.02(0.02) 

0.04(0.04) 

0.07(0.05) 

n  *  1/2 

0.02(0.02) 

0.00(0.00) 

0.02(0.03) 

0.02(0.02) 

0.03(0.03) 

0.07(0.05) 

n*  1/4 

0.02(0.04) 

-0.00(0.02) 

0.02(0.03) 

0.02(0.03) 

0.04(0.06) 

0.05(0.05) 

Accelerometry  (d  =  5,  n 

=  300,  and  c  = 

:  3) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.50(0.03) 

0.20(0.26) 

0.60(0.16) 

0.73(0.05) 

0.61(0.24) 

0.68(0.12) 

n*  3/4 

0.50(0.05) 

0.25(0.29) 

0.64(0.18) 

0.72(0.08) 

0.60(0.25) 

0.69(0.12) 

n  *  1/2 

0.51(0.09) 

0.33(0.30) 

0.65(0.18) 

0.71(0.09) 

0.62(0.24) 

0.72(0.13) 

n*  1/4 

0.54(0.14) 

0.56(0.21) 

0.65(0.18) 

0.66(0.14) 

0.58(0.23) 

0.71(0.14) 

Speech  (d 

=  50,  n  =  400,  and  c  =  2) 

ARI 

KM 

SCI 

SC2 

MNN 

MIC 

SMIC 

n 

0.00(0.00) 

0.00(0.00) 

0.00(0.00) 

0.04(0.15) 

0.18(0.16) 

0.21(0.25) 

n*  3/4 

0.00(0.01) 

0.00(0.01) 

0.00(0.01) 

0.01(0.09) 

0.17(0.14) 

0.24(0.26) 

n  *  1/2 

0.00(0.01) 

0.01(0.01) 

0.00(0.01) 

0.01(0.05) 

0.13(0.11) 

0.17(0.22) 

n*  1/4 

0.01(0.03) 

0.01(0.02) 

0.00(0.02) 

0.02(0.07) 

0.12(0.12) 

0.09(0.18) 

84 


Table  3:  Experimental  results  on  real-world  datasets  under  imbalanced  setup.  ARI  values 
are  described  in  the  table.  Class-imbalance  was  realized  by  setting  the  sample  size  of  the 
first  class  m  times  larger  than  other  classes.  SMIC  was  computed  with  the  uniform  prior 
(i.e.,  the  non-informative  prior).  The  results  for  rn  —  1  are  the  same  as  the  ones  reported 
in  Table  1. 


Digit  (d  =  256,  n  =  5000,  and  c  =  10) 


KM 

SC 

MNN 

MIC 

SMIC 

m  =  1 

0.42(0.01) 

0.24(0.02) 

0.44(0.03) 

0.63(0.08) 

0.63(0.05) 

m  =  2 

0.52(0.01) 

0.21(0.02) 

0.43(0.04) 

0.60(0.05) 

0.63(0.05) 

Document  ( d  = 

50,  n  =  700,  i 

md  c  =  7) 

KM 

SC 

MNN 

MIC 

SMIC 

rn  —  1 

0.00(0.00) 

0.09(0.02) 

0.09(0.02) 

0.01(0.02) 

0.19(0.03) 

rn  —  2 

0.01(0.01) 

0.10(0.03) 

0.10(0.02) 

0.01(0.02) 

0.19(0.04) 

rn  —  3 

0.01(0.01) 

0.10(0.03) 

0.09(0.02) 

-0.01(0.03) 

0.16(0.05) 

rn  —  4 

0.02(0.01) 

0.09(0.03) 

0.08(0.02) 

-0.00(0.04) 

0.14(0.05) 

Word  ( d  =  50,  n  =  300,  and  c  =  3) 

KM 

SC 

MNN 

MIC 

SMIC 

rn  —  1 

0.04(0.05) 

0.02(0.01) 

0.02(0.02) 

0.04(0.04) 

0.08(0.05) 

rn  —  2 

0.00(0.07) 

-0.01(0.01) 

0.01(0.02) 

-0.02(0.05) 

0.03(0.05) 

Accelerometry  ( d 

=  5,  n  —  300, 

and  c  =  3) 

KM 

SC 

MNN 

MIC 

SMIC 

rn  —  1 

0.49(0.04) 

0.58(0.14) 

0.71(0.05) 

0.57(0.23) 

0.68(0.12) 

rn  —  2 

0.48(0.05) 

0.54(0.14) 

0.58(0.11) 

0.49(0.19) 

0.69(0.16) 

rn  —  3 

0.49(0.05) 

0.47(0.10) 

0.42(0.12) 

0.42(0.14) 

0.66(0.20) 

rn  —  4 

0.49(0.06) 

0.38(0.11) 

0.31(0.09) 

0.40(0.18) 

0.56(0.22) 

types  of  data.  However,  the  amount  of  improvement  is  large  for  some  datasets,  while  it  is 
mild  for  other  datasets.  It  is  thus  practically  important  to  gain  more  insights  on  in  what 
case  the  proposed  method  is  advantageous.  Also,  theoretically  elucidating  statistical  con¬ 
sistency  of  the  proposed  method  as  well  as  investigating  the  perturbation  stability  in  more 
details  is  also  an  important  challenge.  We  will  also  analyze  properties  of  other  popular 
clustering  algorithms  within  the  framework  of  information-maximization  clustering. 

The  sparse  local-scaling  kernel  (34)  was  shown  to  be  useful  in  experiments.  Since  this 
produces  a  sparse  kernel  matrix,  the  computation  of  SMIC  (i.e.,  solving  a  kernel  eigenvalue 
problem)  can  be  carried  out  very  efficiently.  However,  if  model  selection  is  taken  into 
account,  the  proposed  clustering  procedure  is  still  computationally  rather  demanding 
due  to  the  repeated  computation  of  LSMI,  which  requires  to  solve  a  system  of  linear 
equations.  In  the  experiments,  we  used  the  Gaussian  kernel  (38)  for  LSMI  and  found  it 
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useful  in  practice.  However,  it  produces  a  dense  kernel  matrix  and  thus  a  dense  system 
of  linear  equations  need  to  be  solved,  which  is  computationally  expensive.  If  a  sparse 
kernel  is  used  also  for  LSMI,  its  computational  efficiency  will  be  highly  improved.  In  our 
preliminary  experiments,  the  use  of  the  sparse  local-scaling  kernel  for  LSMI  improved  the 
computational  efficiency,  but  it  did  not  perform  as  well  as  the  Gaussian  kernel.  Thus,  our 
important  future  work  is  to  find  a  sparse  kernel  that  gives  an  accurate  approximation  of 
SMI  with  high  computational  efficiency. 

As  addressed  in  [153],  kernelized  methods  can  be  applied  to  clustering  of  non-vectorial 
structured  objects  such  as  strings ,  trees ,  and  graphs  by  employing  kernel  functions  defined 
for  such  structured  data  [115,  47,  90, 101,  91,  55,  54],  Since  these  structured  kernels  usually 
contain  tuning  parameters,  the  performance  of  clustering  methods  without  systematic 
model  selection  strategies  depends  on  subjective  parameter  tuning,  which  is  not  preferable 
in  practice.  For  Gaussian  kernels,  there  exists  a  popular  heuristic  that  the  Gaussian 
width  is  set  to  the  median  distance  between  samples  [145] .  However,  there  seems  no  such 
common  heuristic  for  structured  kernels.  In  such  scenarios,  the  proposed  method  will  be 
highly  advantageous  because  it  allows  systematic  model  selection  for  any  kernels.  We  will 
explore  this  direction  in  our  future  work. 

We  experimentally  showed  that  the  proposed  method  with  the  uniform  class-prior 
distribution  still  works  well  even  when  the  true  class-prior  probability  is  not  uniform.  This 
is  a  useful  property  in  practice  since  the  true  class-prior  probability  is  often  unknown. 
Another  way  to  address  this  issue  is  to  estimate  the  true  class-prior  probability  in  a  data- 
driven  fashion,  for  example,  iteratively  performing  clustering  and  updating  the  class-prior 
probabilities.  We  will  investigate  such  an  adaptive  approach  in  our  future  work. 

The  proposed  method  uses  SMI  as  the  common  guidance  for  clustering,  although  we 
are  using  two  SMI  approximators:  SMI  defined  by  Eq.(35)  for  finding  clustering  solutions 
and  LSMI  defined  by  Eq.(41)  for  selecting  models.  Since  SMI  does  not  explicitly  include 
cluster  labels  {2/,:}”=1,  it  has  a  simple  form  and  therefore  is  suited  for  efficient  maximization. 
Indeed,  we  can  obtain  an  optimal  solution  analytically  by  solving  an  eigenvalue  problem. 
However,  since  SMI  is  an  unsupervised  estimator  where  the  cluster  labels  {yi}™=1  are  not 
used,  it  may  not  be  accurate  enough  for  model  selection  purposes.  Indeed,  our  preliminary 
experiments  showed  that  the  use  of  SMI  is  not  appropriate  as  a  model  selection  criterion. 
On  the  other  hand,  since  LSMI  achieves  the  optimal  non-parametric  convergence  rate,  its 
high  accuracy  is  suitable  for  model  selection  purposes.  However,  LSMI  explicitly  requires 
cluster  labels  {?/i }”=1  and  thus  is  not  suited  for  efficient  maximization.  Based  on  the 
optimality  of  LSMI,  we  ideally  want  to  use  LSMI  consistently  for  both  finding  clustering 
solutions  and  selecting  models.  However,  its  optimization  involves  discrete  optimization 
of  {yi}1=n  which  is  cumbersome  in  practice.  Our  future  challenge  is  to  develop  a  practical 
clustering  algorithm  based  directly  on  LSMI  or  alternative  information  measures. 
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5.6  Proof  of  Theorem  2 

For  the  kernel  matrix  K ,  the  optimal  value  v  can  be  expressed  as 


where  7ii  <  ■  ■  ■  <  nc  are  class-prior  probabilities,  Ai(.K’)  >  •  •  •  >  XC(K)  >  0  are  eigen¬ 
values  of  K,  and  the  solutions  <j>i, . . . ,  4>c  are  given  by  the  eigenvectors  associated  with 
Ai (K), . . . ,  XC(K).  The  optimal  value  v'  and  solutions  cj)'c  for  K'  can  be  charac¬ 

terized  similarly.  Then  we  have 

i— 'i  = 

y=  i  y 

=  T  J2  ^  (A ,(K)  +  A y(K'))  (A y(K)  -  \(K')) 

y= i  v 

W  £  ^  (\W  +  Ay(K'))  |A„(/f)  -  As(/f')l 

y= i  y 

j/=l  y 

1  j/=l 

=  ^S(tr(Jf)  +  tr(Jf,)) 

=  ||^||Frob/7rl, 

where,  in  the  third  line,  we  used  \Xy(K)  —  Xy(K')\  <  ||  A||Frob  implied  by  Eqs.(42)  and  (43), 
and  we  used  in  the  last  line  tr (K)  =  tr(K')  =  n  implied  by  the  assumption  K(x,x)  =  1 
for  all  x.  Thus,  Eq.(45)  was  proved. 

Eq.(46)  is  immediately  implied  by  Eq.(44).  More  specifically,  <py  needs  to  be  carefully 
chosen  from  the  corresponding  eigenspace  of  K '  by  minimizing  the  angle  between  4>’y  and 
( py  (i.e. ,  maximizing  (f)y(f)y),  since  the  optimal  solution  to  SMIC  is  not  necessarily  unique. 
However,  if  (fi’y  is  set  to  be  the  eigenvector  associated  to  eigenvalue  fij  with  multiplicity 
one,  we  only  need  to  determine  its  sign.  □ 

5.7  Proof  of  Theorem  3 

We  use  the  following  two  lemmas  in  the  proof  of  Theorem  3: 

Lemma  4.  For  a.,/3  G  W1,  we  have 

\\a:-(3\\l  >  ||ck+  —  /3+|||  +  ||ck_  —  /3~||1, 
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where  a+  =  max(On,  a)  and  ex  =  min(Ore,  cx),  and  max  and  min  for  vectors  are  computed 
in  element-wise  manners. 

Proof.  Denote  by  cq  and  /3j  the  i-th  components  of  cx  and  /3,  respectively.  Then,  for  all 
i,  we  have 

(«<  -  A)2  =  K+  -  A+  )2  +  -  A7 )2,  if  >  o, 

(«t  -  A)2  >  K+  -  A+  )2  +  K7  -  A7 )A  if  a<A  <  0, 

which  complete  the  proof.  □ 

Lemma  5.  For  cx,  (3  E  we  have 

\\cx(3t  -  {3cxt ||Frob  <  v/2||«||2||/3||2. 

Proof.  By  dehnition, 

II cx(3t  -  (3cxT |||rob  =  tr((«/3T  -  f3cxT)T  (cxf3T  -  (3cxT )) 

=  tr((/3«T  —  cx/3t)(cx/3t  —  /3cxt )) 

=  tr  (/3  cxT  cx/3t  —  {3cxt  (3cxt  —  cx/3Tcx/3T  +  cx/3T  f3cxT) 

=  ||  ck  |||tr(/3/3T)  —  a1  f3  tr  ((3  a.T)  —  /3T  CKtr(o;/3T)  +  ||/3|||tr(cKCKT) 

=  ||«||^||/3||2-(aT/3)2-(/3TCK)2  +  ||/3||2||«||2 

<2||a||2||/3||2. 

The  lemma  follows  by  taking  square  roots  of  the  beginning  and  the  end  of  the  above  chain 
of  equations.  □ 

Using  the  above  lemmas,  we  prove  Theorem  3.  First  of  all,  we  have 

,  =  ^,(0+0;+T  -  0A 

Jy  Jy  w-w 

Since 

ll<A/l|i  =  ll<Aj1|i  +  ll<A/l|i,  ln<t>v  =  II^JIIi  -  ll<A/l|i,  andl702/>O, 

we  can  know  that  ||</>7 111  >  ||<AI|i/2,  and 

=  WJIIi  >  HtfJi/2  >  ||0,||j/2  =  1/2. 

Similarly,  >1/2  and  thus  it  turns  out  that 

(l70+-l70'+)>l/4. 

Next,  let  cx  =  </>+  and  (3  =  4>'y  —  <f>y.  Then  it  holds  that 

<t>y<t>'yT  -  <Ay+<A[T  =  «(«  +  PV  -  («  +  /3)«T  =  «/3T  -  (3cxT . 
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Tabic  4:  Notation  for  Rand  index  and  adjusted  Rand  index. 


Consequently,  we  have 

Wfy-fyh  <  4^11  («/3t  -  (3aT)ln\\2 

<  47rJln||2||«/3T  -  /3«t||2 

<  4:^iTy\\atf3T  -  f3(XT ||Frob, 

where  ||  •  ||2  on  Mnxn  means  the  operator  norm  induced  by  ||  •  ||2  on  Mn,  and  the  last  line 
is  due  to  the  fact  that  ||  •  ||2  is  the  0-norm  of  the  spectra  and  ||  •  1 1 Frob  is  the  f2-norm  of 
the  spectra.  According  to  Lemma  5,  it  holds  that 

II  fy  -  fyh  <  407^72110:11211/3112 

=  4  A/207Ty  1 1  07  1 1 2  1 1 -  4>yh 

<  47^70  <0||  2  ||  <0  -  <t>y\\ 2 

<  16V2nny\\A\\Froh/dr, 

where  the  third  line  is  due  to  Lemma  4,  and  we  used  in  the  last  line  the  facts  that  (py  is 
an  eigenvector  of  K  and  <0  satisfies  Eq.(46).  Finally,  dividing  the  above  inequality  by 
\pn  completes  the  proof.  □ 

5.8  Rand  Index  and  Adjusted  Rand  Index 

Here,  we  review  the  definitions  of  the  Rand  index  (RI)  [135]  and  the  adjusted  Rand  index 
(ARI)  [74],  which  are  used  for  evaluating  the  quality  of  clustering  results.  Let  {y* }{=1  be 
the  ground-truth  cluster  assignments,  and  let  {07 1  be  a  clustering  solution  obtained  by 
some  algorithm.  The  goal  is  to  quantitatively  evaluate  the  similarity  between  {00  and 

The  most  direct  way  to  evaluate  the  discrepancy  between  {0f=i  and  {y*  }”=1  would  be 
to  naively  verify  the  correctness  of  the  predicted  labels.  However,  in  clustering,  predicted 
class  labels  {00  do  not  have  to  be  equal  to  the  true  labels  {r/*}"=1,  but  only  their 
partition  matters.  The  correctness  of  the  partition  may  be  evaluated  by  verifying  the 
correctness  of  the  predicted  labels  for  all  possible  label  permutations.  However,  this  is 
computationally  expensive  if  the  number  of  classes  is  large.  RI  and  ARI  are  alternative 
performance  measures  that  can  overcome  this  computational  problem  in  a  systematic  way. 
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For  the  two  partitions  {y«}"=1  and  (y* }”=i ,  let  C.y  and  C*  (y  =  l,...,c)  be  sets  of 
indices  of  samples  in  cluster  y,  respectively: 


Cy  {Vi  |  Vi  y}, 

c;  =  [y*  I  »,*  =  y}- 

Let  be  the  number  of  samples  that  are  assigned  to  the  cluster  Cy  and  the  cluster  C*,. 
Let  riy  (resp.  n *)  be  the  number  of  samples  that  are  assigned  to  the  cluster  Cy  (resp.  C*,). 
The  notation  is  summarized  in  Table  4(a). 

Let  mcfi *,  mc  c*  i  mCc*i  ancl  mc,c *  be  defined  as 


mc,c* 


mc,c* 


—  mc,c* , 


mc,c* 


mC,C * 


mc,c* , 


~~  mC,C*  —  mC,C*  —  mC,C *  1 


where  mcc*  denotes  the  number  of  pairs  of  samples  that  are  assigned  to  the  same  cluster 
both  in  {Cy]cy=i  and  {C*,} y=1,  mC£*  denotes  the  number  of  pairs  of  samples  that  are 
assigned  to  the  same  cluster  in  {Cy}y=1  but  are  assigned  to  different  clusters  in  {C*,}y=1, 
mc,c*  denotes  the  number  of  pairs  of  samples  that  are  assigned  to  the  same  cluster  in 
{C*,}y,=1  but  are  assigned  to  different  clusters  in  {Cy}y=1,  and  denotes  the  number 

of  pairs  of  samples  that  are  assigned  to  different  clusters  both  in  {Cy}cy=\  and  {C*,}y/=1. 
mc£*  +  nice*  can  be  considered  as  the  number  of  “agreements”  between  {Cy}y=1  and 
{C*y,}cyl=v  while  mc  c *  +  mc,c*  can  be  regarded  as  the  number  of  “disagreements”  between 
{Cy}cy=i  and  {C*,}Cy'=i-  The  notation  is  summarized  in  Table  4(b). 

The  Rand  index  (RI)  [135]  is  defined  and  expressed  as 

RI  ,= _ mc£*  +  mc£* _ 

mc,c*  +  mcc*  +  mc£*  +  mc£* 

=  (mc,c*  +  mcc* )  I  Q  • 


The  Rand  index  lies  between  0  and  1,  and  takes  1  if  the  two  clustering  solutions  {Cy}y=1 
and  {C*,}y,=1  agree  with  each  other  perfectly. 

A  potential  drawback  of  the  Rand  index  is  that  its  expected  value  is  not  a  constant 
(say,  0)  if  two  clustering  solutions  are  completely  random.  To  overcome  this  problem,  the 
adjusted  Rand  index  (ARI)  was  proposed  [74],  ARI  is  defined  as 

_ _ mg£*  +  mc£*  —  y _ 

mC£*  +  rnc  c*  +  nicy*  +  nicy*  -  y ' 
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/i  is  the  expected  value  of  m<yc*  +  mc,c*  '■ 

fi:=E  [mcfi*  +  mcfi*\  , 

where  E  denotes  the  expectation  over  cluster  assignments.  ARI  takes  the  maximum  value 
1  when  two  sets  of  cluster  assignments  are  identical,  and  takes  0  if  the  index  equals  its 
expected  value. 

Under  the  assumption  that  the  clustering  solutions  {CyYy=\  and  {C*,}y,=1  are  randomly 
drawn  from  a  generalized  hyper-geometric  distribution,  it  holds  that 


E  [■ mc,c •]  =  (■ mc,c *  +  mC£*)('mc,c*  +  m^.) /  Q)  , 

E  [nice*]  =  (mc,c*  +  mc,c*)(rnc,c*  +  mc,c *) 

Then  ARI  can  be  expressed  as 


ARI 


Note  that  RI  and  ARI  can  be  defined  even  when  two  sets  of  cluster  assignments  {y,:}”=i 
and  {y*}i=\  have  different  numbers  of  clusters,  i.e.,  {Cy}y=1  and  {C*,]y=1  with  c  Y  c' ■  This 
is  highly  convenient  in  practice  since,  when  the  number  of  true  clusters  is  large,  clustering 
algorithms  often  produce  clustering  solutions  with  a  smaller  number  of  clusters  (i.e.,  some 
of  the  clusters  have  no  samples).  Even  in  such  cases,  RI  and  ARI  can  still  be  used  for 
evaluating  the  quality  of  clustering  solutions. 
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